ELMo

About

ELMo is a deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. They can be easily added to existing models and significantly improve the state of the art across a broad range of challenging NLP problems, including question answering, textual entailment and sentiment analysis.

Salient features

ELMo representations are:

  • Contextual: The representation for each word depends on the entire context in which it is used.
  • Deep: The word representations combine all layers of a deep pre-trained neural network.
  • Character based: ELMo representations are purely character based, allowing the network to use morphological clues to form robust representations for out-of-vocabulary tokens unseen in training.

Key result

Adding ELMo to existing NLP systems significantly improves the state-of-the-art for every considered task. In most cases, they can be simply swapped for pre-trained GloVe or other word vectors.

TaskPrevious SOTAOur baselineELMo + BaselineIncrease (Absolute/Relative)
SQuAD
SAN84.4
81.185.84.7 / 24.9%
SNLI
Chen et al (2017)88.6
88.088.7 +/- 0.170.7 / 5.8%
SRL
He et al (2017)81.7
81.484.63.2 / 17.2%
Coref
Lee et al (2017)67.2
67.270.43.2 / 9.8%
NER
Peters et al (2017)91.93 +/- 0.19
90.1592.22 +/- 0.102.06 / 21%
Sentiment (5-class)
McCann et al (2017)53.7
51.454.7 +/- 0.53.3 / 6.8%

Pre-trained ELMo Models

ModelLink (Weights/Options File)# Parameters (Millions)LSTM Hidden Size/Output size# Highway LayersSRL F1Constituency Parsing F1
Small13.61024/128183.6293.12
Medium28.02048/256184.0493.60
Original93.64096/512284.6393.85
Original (5.5B)93.64096/512284.9394.01

The baseline models described are from the original ELMo paper for SRL and from Extending a Parser to Distant Domains Using a Few Dozen Partially Annotated Examples (Joshi et al, 2018) for the Constituency Parser. We do not include GloVe vectors in these models to provide a direct comparison between ELMo representations - in some cases, this results in a small drop in performance (0.5 F1 for the Constituency Parser, > 0.1 for the SRL model).

All models except for the 5.5B model were trained on the 1 Billion Word Benchmark, approximately 800M tokens of news crawl data from WMT 2011. The ELMo 5.5B model was trained on a dataset of 5.5B tokens consisting of Wikipedia (1.9B) and all of the monolingual news crawl data from WMT 2008-2012 (3.6B). In tasks where we have made a direct comparison, the 5.5B model has slightly higher performance then the original ELMo model, so we recommend it as a default model.

Contributed ELMo Models

ELMo models have been trained for other languages and domains. We maintain a list of models here but are unable to respond to quality issues ourselves.

ModelLink (Weights/Options File)Contributor/Notes
Portuguese (Wikipedia corpus)Federal University of Goiás (UFG). Pedro Vitor Quinta de Castro, Anderson da Silva Soares, Nádia Félix Felipe da Silva, Rafael Teixeira Sousa, Ayrton Denner da Silva Amaral. Sponsered by Data-H, Aviso Urgente, and Americas Health Labs.
Portuguese (brWaC corpus)Federal University of Goiás (UFG). Pedro Vitor Quinta de Castro, Anderson da Silva Soares, Nádia Félix Felipe da Silva, Rafael Teixeira Sousa, Ayrton Denner da Silva Amaral. Sponsered by Data-H, Aviso Urgente, and Americas Health Labs.
JapaneseExaWizards Inc. Enkhbold Bataa, Joshua Wu. (paper)
GermanPhilip May & T-Systems onsite
BasqueStefan Schweter
PubMedMatthew Peters
Transformer ELMoJoel Grus and Brendan Roof

Code releases and AllenNLP integration

There are reference implementations of the pre-trained bidirectional language model available in both PyTorch and TensorFlow. The PyTorch verison is fully integrated into AllenNLP. The TensorFlow version is also available in bilm-tf.

Training models

You can retrain ELMo models using the tensorflow code in bilm-tf.

More information

See our paper Deep contextualized word representations for more information about the algorithm and a detailed analysis.

Links