ELMo is a deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. They can be easily added to existing models and significantly improve the state of the art across a broad range of challenging NLP problems, including question answering, textual entailment and sentiment analysis.

Salient features

ELMo representations are:

Key result

Adding ELMo to existing NLP systems significantly improves the state-of-the-art for every considered task. In most cases, they can be simply swapped for pre-trained GloVe or other word vectors.

TaskPrevious SOTA Our baselineELMo + BaselineIncrease (Absolute/Relative)
SQuADSAN84.481.185.84.7 / 24.9%
SNLIChen et al (2017)88.688.088.7 +/- 0.170.7 / 5.8%
SRLHe et al (2017)81.781.484.63.2 / 17.2%
CorefLee et al (2017) / 9.8%
NERPeters et al (2017)91.93 +/- 0.1990.1592.22 +/- 0.102.06 / 21%
Sentiment (5-class)McCann et al (2017)53.751.454.7 +/- 0.53.3 / 6.8%

Pre-trained ELMo Models

Model Link(Weights/Options File)   # Parameters (Millions) LSTM Hidden Size/Output size # Highway Layers> SRL F1 Constituency Parsing F1
Small weights options 13.6 1024/128 1 83.62 93.12
Medium weights options 28.0 2048/256 1 84.04 93.60
Original weights options 93.6 4096/512 2 84.63 93.85
Original (5.5B) weights options 93.6 4096/512 2 84.93 94.01

The baseline models described are from the original ELMo paper for SRL and from Extending a Parser to Distant Domains Using a Few Dozen Partially Annotated Examples (Joshi et al, 2018) for the Constituency Parser. We do not include GloVe vectors in these models to provide a direct comparison between ELMo representations - in some cases, this results in a small drop in performance (0.5 F1 for the Constituency Parser, > 0.1 for the SRL model).

All models except for the 5.5B model were trained on the 1 Billion Word Benchmark, approximately 800M tokens of news crawl data from WMT 2011. The ELMo 5.5B model was trained on a dataset of 5.5B tokens consisting of Wikipedia (1.9B) and all of the monolingual news crawl data from WMT 2008-2012 (3.6B). In tasks where we have made a direct comparison, the 5.5B model has slightly higher performance then the original ELMo model, so we recommend it as a default model.

Code releases and AllenNLP integration

There are reference implementations of the pre-trained bidirectional language model available in both PyTorch and TensorFlow. The PyTorch verison is fully integrated into AllenNLP, with a detailed tutorial available. The TensorFlow version is also available in bilm-tf.

Training models

You can retrain ELMo models using the tensorflow code in bilm-tf.

More information

See our paper Deep contextualized word representations for more information about the algorithm and a detailed analysis.


  author={Peters, Matthew E. and  Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke},
  title={Deep contextualized word representations},
  booktitle={Proc. of NAACL},