ELMo

About

ELMo is a deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. They can be easily added to existing models and significantly improve the state of the art across a broad range of challenging NLP problems, including question answering, textual entailment and sentiment analysis.

Salient features

ELMo representations are:

Contextual: The representation for each word depends on the entire context in which it is used.
Deep: The word representations combine all layers of a deep pre-trained neural network.
Character based: ELMo representations are purely character based, allowing the network to use morphological clues to form robust representations for out-of-vocabulary tokens unseen in training.

Key result

Adding ELMo to existing NLP systems significantly improves the state-of-the-art for every considered task. In most cases, they can be simply swapped for pre-trained GloVe or other word vectors.

Task	Previous SOTA	Our baseline	ELMo + Baseline	Increase (Absolute/Relative)
SQuAD	SAN84.4	81.1	85.8	4.7 / 24.9%
SNLI	Chen et al (2017)88.6	88.0	88.7 +/- 0.17	0.7 / 5.8%
SRL	He et al (2017)81.7	81.4	84.6	3.2 / 17.2%
Coref	Lee et al (2017)67.2	67.2	70.4	3.2 / 9.8%
NER	Peters et al (2017)91.93 +/- 0.19	90.15	92.22 +/- 0.10	2.06 / 21%
Sentiment (5-class)	McCann et al (2017)53.7	51.4	54.7 +/- 0.5	3.3 / 6.8%

Pre-trained ELMo Models

Model	Link (Weights/Options File)	# Parameters (Millions)	LSTM Hidden Size/Output size	# Highway Layers	SRL F1	Constituency Parsing F1
Small	weights options	13.6	1024/128	1	83.62	93.12
Medium	weights options	28.0	2048/256	1	84.04	93.60
Original	weights options	93.6	4096/512	2	84.63	93.85
Original (5.5B)	weights options	93.6	4096/512	2	84.93	94.01

The baseline models described are from the original ELMo paper for SRL and from Extending a Parser to Distant Domains Using a Few Dozen Partially Annotated Examples (Joshi et al, 2018) for the Constituency Parser. We do not include GloVe vectors in these models to provide a direct comparison between ELMo representations - in some cases, this results in a small drop in performance (0.5 F1 for the Constituency Parser, > 0.1 for the SRL model).

All models except for the 5.5B model were trained on the 1 Billion Word Benchmark, approximately 800M tokens of news crawl data from WMT 2011. The ELMo 5.5B model was trained on a dataset of 5.5B tokens consisting of Wikipedia (1.9B) and all of the monolingual news crawl data from WMT 2008-2012 (3.6B). In tasks where we have made a direct comparison, the 5.5B model has slightly higher performance then the original ELMo model, so we recommend it as a default model.

Contributed ELMo Models

ELMo models have been trained for other languages and domains. We maintain a list of models here but are unable to respond to quality issues ourselves.

Model	Link (Weights/Options File)	Contributor/Notes
Portuguese (Wikipedia corpus)	weights options	Federal University of Goiás (UFG). Pedro Vitor Quinta de Castro, Anderson da Silva Soares, Nádia Félix Felipe da Silva, Rafael Teixeira Sousa, Ayrton Denner da Silva Amaral. Sponsered by Data-H, Aviso Urgente, and Americas Health Labs.
Portuguese (brWaC corpus)	weights options	Federal University of Goiás (UFG). Pedro Vitor Quinta de Castro, Anderson da Silva Soares, Nádia Félix Felipe da Silva, Rafael Teixeira Sousa, Ayrton Denner da Silva Amaral. Sponsered by Data-H, Aviso Urgente, and Americas Health Labs.
Japanese	weights options	ExaWizards Inc. Enkhbold Bataa, Joshua Wu. (paper)
German	code and weights	Philip May & T-Systems onsite
Basque	code and weights	Stefan Schweter
PubMed	weights options	Matthew Peters
Transformer ELMo	model archive	Joel Grus and Brendan Roof

Code releases and AllenNLP integration

There are reference implementations of the pre-trained bidirectional language model available in both PyTorch and TensorFlow. The PyTorch verison is fully integrated into AllenNLP. The TensorFlow version is also available in bilm-tf.

Training models

You can retrain ELMo models using the tensorflow code in bilm-tf.

More information

See our paper Deep contextualized word representations for more information about the algorithm and a detailed analysis.

Links

GitHub

View ELMo on GitHub

Docs

View ELMo docs

PyPI

View ELMo on PyPI

Natural Language Processing

Computer Vision

AI for the Environment

Experimentation and Communication

Software