ELMo is a deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. They can be easily added to existing models and significantly improve the state of the art across a broad range of challenging NLP problems, including question answering, textual entailment and sentiment analysis.
ELMo representations are:
Adding ELMo to existing NLP systems significantly improves the state-of-the-art for every considered task. In most cases, they can be simply swapped for pre-trained GloVe or other word vectors.
Task | Previous SOTA | Our baseline | ELMo + Baseline | Increase (Absolute/Relative) | |
SQuAD | SAN | 84.4 | 81.1 | 85.8 | 4.7 / 24.9% |
SNLI | Chen et al (2017) | 88.6 | 88.0 | 88.7 +/- 0.17 | 0.7 / 5.8% |
SRL | He et al (2017) | 81.7 | 81.4 | 84.6 | 3.2 / 17.2% |
Coref | Lee et al (2017) | 67.2 | 67.2 | 70.4 | 3.2 / 9.8% |
NER | Peters et al (2017) | 91.93 +/- 0.19 | 90.15 | 92.22 +/- 0.10 | 2.06 / 21% |
Sentiment (5-class) | McCann et al (2017) | 53.7 | 51.4 | 54.7 +/- 0.5 | 3.3 / 6.8% |
Model | Link(Weights/Options File) | # Parameters (Millions) | LSTM Hidden Size/Output size | # Highway Layers> | SRL F1 | Constituency Parsing F1 | |
Small | weights | options | 13.6 | 1024/128 | 1 | 83.62 | 93.12 |
Medium | weights | options | 28.0 | 2048/256 | 1 | 84.04 | 93.60 |
Original | weights | options | 93.6 | 4096/512 | 2 | 84.63 | 93.85 |
Original (5.5B) | weights | options | 93.6 | 4096/512 | 2 | 84.93 | 94.01 |
The baseline models described are from the original ELMo paper for SRL and from Extending a Parser to Distant Domains Using a Few Dozen Partially Annotated Examples (Joshi et al, 2018) for the Constituency Parser. We do not include GloVe vectors in these models to provide a direct comparison between ELMo representations - in some cases, this results in a small drop in performance (0.5 F1 for the Constituency Parser, > 0.1 for the SRL model).
All models except for the 5.5B model were trained on the 1 Billion Word Benchmark, approximately 800M tokens of news crawl data from WMT 2011. The ELMo 5.5B model was trained on a dataset of 5.5B tokens consisting of Wikipedia (1.9B) and all of the monolingual news crawl data from WMT 2008-2012 (3.6B). In tasks where we have made a direct comparison, the 5.5B model has slightly higher performance then the original ELMo model, so we recommend it as a default model.
Model | Link(Weights/Options File) | Contributor/Notes | |
Portuguese (Wikipedia corpus) | weights | options | Federal University of Goiás (UFG). Pedro Vitor Quinta de Castro, Anderson da Silva Soares, Nádia Félix Felipe da Silva, Rafael Teixeira Sousa, Ayrton Denner da Silva Amaral. Sponsered by Data-H, Aviso Urgente, and Americas Health Labs. |
Portuguese (brWaC corpus) | weights | options | |
Japanese | weights | options | ExaWizards Inc. Enkhbold Bataa, Joshua Wu. (paper) |
German | code and weights | Philip May & T-Systems onsite | |
Basque | code and weights | Stefan Schweter | |
PubMed | weights | options | Matthew Peters |
Transformer ELMo | model archive | Joel Grus and Brendan Roof |
There are reference implementations of the pre-trained bidirectional language model available in both PyTorch and TensorFlow. The PyTorch verison is fully integrated into AllenNLP. The TensorFlow version is also available in bilm-tf.
You can retrain ELMo models using the tensorflow code in bilm-tf.
See our paper Deep contextualized word representations for more information about the algorithm and a detailed analysis.
Citation:
@inproceedings{Peters:2018, author={Peters, Matthew E. and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke}, title={Deep contextualized word representations}, booktitle={Proc. of NAACL}, year={2018} }