Welcome to AllenNLP! This tutorial will walk you through the basics of building and training an AllenNLP model.

Looking for something else?  Check out our other tutorials.

Before we get started, make sure you have a clean Python 3.6 or 3.7 virtual environment, and then run the following command to install the AllenNLP library:

pip install allennlp

In this tutorial we'll implement a slightly enhanced version of the PyTorch LSTM for Part-of-Speech Tagging tutorial, adding some features that make it a slightly more realistic task (and that also showcase some of the benefits of AllenNLP):

  1. We'll read our data from files. (The tutorial example uses data that's given as part of the Python code.)
  2. We'll use a separate validation dataset to check our performance. (The tutorial example trains and evaluates on the same dataset.)
  3. We'll use tqdm to track the progress of our training.
  4. We'll implement early stopping based on the loss on the validation dataset.
  5. We'll track accuracy on both the training and validation sets as we train the model.

(In addition to what's highlighted in this tutorial, AllenNLP provides many other "for free" features.)

The Problem

Given a sentence (e.g. "The dog ate the apple") we want to predict part-of-speech tags for each word
(e.g ["DET", "NN", "V", "DET", "NN"]).

As in the PyTorch tutorial, we'll embed each word in a low-dimensional space, pass them through an LSTM to get a sequence of encodings, and use a feedforward layer to transform those into a sequence of logits (corresponding to the possible part-of-speech tags).

Below is the annotated code for accomplishing this. You can start reading the annotations from the top, or just look through the code and look to the annotations when you need more explanation.

from typing import Iterator, List, Dict
import torch
import torch.optim as optim
import numpy as np
from allennlp.data import Instance
from allennlp.data.fields import TextField, SequenceLabelField
from allennlp.data.dataset_readers import DatasetReader
from allennlp.common.file_utils import cached_path
from allennlp.data.token_indexers import TokenIndexer, SingleIdTokenIndexer
from allennlp.data.tokenizers import Token
from allennlp.data.vocabulary import Vocabulary
from allennlp.models import Model
from allennlp.modules.text_field_embedders import TextFieldEmbedder, BasicTextFieldEmbedder
from allennlp.modules.token_embedders import Embedding
from allennlp.modules.seq2seq_encoders import Seq2SeqEncoder, PytorchSeq2SeqWrapper
from allennlp.nn.util import get_text_field_mask, sequence_cross_entropy_with_logits
from allennlp.training.metrics import CategoricalAccuracy
from allennlp.data.iterators import BucketIterator
from allennlp.training.trainer import Trainer
from allennlp.predictors import SentenceTaggerPredictor

class PosDatasetReader(DatasetReader):
    DatasetReader for PoS tagging data, one sentence per line, like

        The###DET dog###NN ate###V the###DET apple###NN
    def __init__(self, token_indexers: Dict[str, TokenIndexer] = None) -> None:
        self.token_indexers = token_indexers or {"tokens": SingleIdTokenIndexer()}
    def text_to_instance(self, tokens: List[Token], tags: List[str] = None) -> Instance:
        sentence_field = TextField(tokens, self.token_indexers)
        fields = {"sentence": sentence_field}

        if tags:
            label_field = SequenceLabelField(labels=tags, sequence_field=sentence_field)
            fields["labels"] = label_field

        return Instance(fields)
    def _read(self, file_path: str) -> Iterator[Instance]:
        with open(file_path) as f:
            for line in f:
                pairs = line.strip().split()
                sentence, tags = zip(*(pair.split("###") for pair in pairs))
                yield self.text_to_instance([Token(word) for word in sentence], tags)
class LstmTagger(Model):
    def __init__(self,
                 word_embeddings: TextFieldEmbedder,
                 encoder: Seq2SeqEncoder,
                 vocab: Vocabulary) -> None:
        self.word_embeddings = word_embeddings
        self.encoder = encoder
        self.hidden2tag = torch.nn.Linear(in_features=encoder.get_output_dim(),
        self.accuracy = CategoricalAccuracy()
    def forward(self,
                sentence: Dict[str, torch.Tensor],
                labels: torch.Tensor = None) -> torch.Tensor:
        mask = get_text_field_mask(sentence)
        embeddings = self.word_embeddings(sentence)
        encoder_out = self.encoder(embeddings, mask)
        tag_logits = self.hidden2tag(encoder_out)
        output = {"tag_logits": tag_logits}
        if labels is not None:
            self.accuracy(tag_logits, labels, mask)
            output["loss"] = sequence_cross_entropy_with_logits(tag_logits, labels, mask)

        return output
    def get_metrics(self, reset: bool = False) -> Dict[str, float]:
        return {"accuracy": self.accuracy.get_metric(reset)}
reader = PosDatasetReader()
train_dataset = reader.read(cached_path(
validation_dataset = reader.read(cached_path(
vocab = Vocabulary.from_instances(train_dataset + validation_dataset)
token_embedding = Embedding(num_embeddings=vocab.get_vocab_size('tokens'),
word_embeddings = BasicTextFieldEmbedder({"tokens": token_embedding})
lstm = PytorchSeq2SeqWrapper(torch.nn.LSTM(EMBEDDING_DIM, HIDDEN_DIM, batch_first=True))
model = LstmTagger(word_embeddings, lstm, vocab)
optimizer = optim.SGD(model.parameters(), lr=0.1)
iterator = BucketIterator(batch_size=2, sorting_keys=[("sentence", "num_tokens")])
trainer = Trainer(model=model,
predictor = SentenceTaggerPredictor(model, dataset_reader=reader)
tag_logits = predictor.predict("The dog ate the apple")['tag_logits']
tag_ids = np.argmax(tag_logits, axis=-1)
print([model.vocab.get_token_from_index(i, 'labels') for i in tag_ids])
  • In AllenNLP we use type annotations for just about everything.
  • AllenNLP is built on top of PyTorch, so we use its code freely.
  • In AllenNLP we represent each training example as an Instance containing Fields of various types. Here each example will have a TextField containing the sentence, and a SequenceLabelField containing the corresponding part-of-speech tags.
  • Typically to solve a problem like this using AllenNLP, you'll have to implement two classes. The first is a DatasetReader, which contains the logic for reading a file of data and producing a stream of Instances.
  • Frequently we'll want to load datasets or models from URLs. The cached_path helper downloads such files, caches them locally, and returns the local path. It also accepts local file paths (which it just returns as-is).
  • There are various ways to represent a word as one or more indices. For example, you might maintain a vocabulary of unique words and give each word a corresponding id. Or you might have one id per character in the word and represent each word as a sequence of ids. AllenNLP uses a has a TokenIndexer abstraction for this representation.
  • Whereas a TokenIndexer represents a rule for how to turn a token into indices, a Vocabulary contains the corresponding mappings from strings to integers. For example, your token indexer might specify to represent a token as a sequence of character ids, in which case the Vocabulary would contain the mapping {character -> id}. In this particular example we use a SingleIdTokenIndexer that assigns each token a unique id, and so the Vocabulary will just contain a mapping {token -> id} (as well as the reverse mapping).
  • Besides DatasetReader, the other class you'll typically need to implement is Model, which is a PyTorch Module that takes tensor inputs and produces a dict of tensor outputs (including the training loss you want to optimize).
  • As mentioned above, our model will consist of an embedding layer, followed by a LSTM, then by a feedforward layer. AllenNLP includes abstractions for all of these that smartly handle padding and batching, as well as various utility functions.
  • We'll want to track accuracy on the training and validation datasets.
  • In our training we'll need a DataIterators that can intelligently batch our data.
  • And we'll use AllenNLP's full-featured Trainer.
  • Finally, we'll want to make predictions on new inputs, more about this below.
  • Our first order of business is to implement our DatasetReader subclass.
  • The only parameter our DatasetReader needs is a dict of TokenIndexers that specify how to convert tokens into indices. By default we'll just generate a single index for each token (which we'll call "tokens") that's just a unique id for each distinct token. (This is just the standard "word to index" mapping you'd use NLP tasks.)
  • DatasetReader.text_to_instance takes the inputs corresponding to a training example (in this case the tokens of the sentence and the corresponding part-of-speech tags), instantiates the corresponding Fields (in this case a TextField for the sentence and a SequenceLabelField for its tags), and returns the Instance containing those fields. Notice that the tags are optional, since we'd like to be able to create instances from unlabeled data to make predictions on them.
  • The other piece we have to implement is _read, which takes a filename and produces a stream of Instances. Most of the work has already been done in text_to_instance.
  • The other class you'll basically always have to implement is Model, which is a subclass of torch.nn.Module. How it works is largely up to you, it mostly just needs a forward method that takes tensor inputs and produces a dict of tensor outputs that includes the loss you'll use to train the model. As mentioned above, our model will consist of an embedding layer, a sequence encoder, and a feedforward network.
  • One thing that might seem unusual is that we're going pass in the embedder and the sequence encoder as constructor parameters. This allows us to experiment with different embedders and encoders without having to change the model code.
  • The embedding layer is specified as an AllenNLP TextFieldEmbedder which represents a general way of turning tokens into tensors. (Here we know that we want to represent each unique word with a learned tensor, but using the general class allows us to easily experiment with different types of embeddings, for example ELMo.)
  • Similarly, the encoder is specified as a general Seq2SeqEncoder even though we know we want to use an LSTM. Again, this makes it easy to experiment with other sequence encoders, for example a Transformer.
  • Every AllenNLP model also expects a Vocabulary, which contains the namespaced mappings of tokens to indices and labels to indices.
  • Notice that we have to pass the vocab to the base class constructor.
  • The feed forward layer is not passed in as a parameter, but is constructed by us. Notice that it looks at the encoder to find the correct input dimension and looks at the vocabulary (and, in particular, at the label -> index mapping) to find the correct output dimension.
  • The last thing to notice is that we also instantiate a CategoricalAccuracy metric, which we'll use to track accuracy during each training and validation epoch.
  • Next we need to implement forward, which is where the actual computation happens. Each Instance in your dataset will get (batched with other instances and) fed into forward. The forward method expects dicts of tensors as input, and it expects their names to be the names of the fields in your Instance. In this case we have a sentence field and (possibly) a labels field, so we'll construct our forward accordingly:
  • AllenNLP is designed to operate on batched inputs, but different input sequences have different lengths. Behind the scenes AllenNLP is padding the shorter inputs so that the batch has uniform shape, which means our computations need to use a mask to exclude the padding. Here we just use the utility function get_text_field_mask, which returns a tensor of 0s and 1s corresponding to the padded and unpadded locations.
  • We start by passing the sentence tensor (each sentence a sequence of token ids) to the word_embeddings module, which converts each sentence into a sequence of embedded tensors.
  • We next pass the embedded tensors (and the mask) to the LSTM, which produces a sequence of encoded outputs.
  • Finally, we pass each encoded output tensor to the feedforward layer to produce logits corresponding to the various tags.
  • As before, the labels were optional, as we might want to run this model to make predictions on unlabeled data. If we do have labels, then we use them to update our accuracy metric and compute the "loss" that goes in our output.
  • We included an accuracy metric that gets updated each forward pass. That means we need to override a get_metrics method that pulls the data out of it. Behind the scenes, the CategoricalAccuracy metric is storing the number of predictions and the number of correct predictions, updating those counts during each call to forward. Each call to get_metric returns the calculated accuracy and (optionally) resets the counts, which is what allows us to track accuracy anew for each epoch.
  • Now that we've implemented a DatasetReader and Model, we're ready to train. We first need an instance of our dataset reader.
  • Which we can use to read in the training data and validation data. Here we read them in from a URL, but you could read them in from local files if your data was local. We use cached_path to cache the files locally (and to hand reader.read the path to the local cached version.)
  • Once we've read in the datasets, we use them to create our Vocabulary (that is, the mapping[s] from tokens / labels to ids).
  • Now we need to construct the model. We'll choose a size for our embedding layer and for the hidden layer of our LSTM.
  • For embedding the tokens we'll just use the BasicTextFieldEmbedder which takes a mapping from index names to embeddings. If you go back to where we defined our DatasetReader, the default parameters included a single index called "tokens", so our mapping just needs an embedding corresponding to that index. We use the Vocabulary to find how many embeddings we need and our EMBEDDING_DIM parameter to specify the output dimension. It's also possible to start with pre-trained embeddings (for example, GloVe vectors), but there's no need to do that on this tiny toy dataset.
  • We next need to specify the sequence encoder. The need for PytorchSeq2SeqWrapper here is slightly unfortunate (and if you use configuration files you won't need to worry about it) but here it's required to add some extra functionality (and a cleaner interface) to the built-in PyTorch module. In AllenNLP we do everything batch first, so we specify that as well.
  • Finally, we can instantiate the model.
  • Now we're ready to train the model. The first thing we'll need is an optimizer. We can just use PyTorch's stochastic gradient descent.
  • And we need a DataIterator that handles batching for our datasets. The BucketIterator sorts instances by the specified fields in order to create batches with similar sequence lengths. Here we indicate that we want to sort the instances by the number of tokens in the sentence field.
  • We also specify that the iterator should make sure its instances are indexed using our vocabulary; that is, that their strings have been converted to integers using the mapping we previously created.
  • Now we instantiate our Trainer and run it. Here we tell it to run for 1000 epochs and to stop training early if it ever spends 10 epochs without the validation metric improving. The default validation metric is loss (which improves by getting smaller), but it's also possible to specify a different metric and direction (e.g. accuracy should get bigger).
  • When we launch it it will print a progress bar for each epoch that includes both the "loss" and the "accuracy" metric. If our model is good, the loss should go down and the accuracy up as we train.
  • As in the original PyTorch tutorial, we'd like to look at the predictions our model generates. AllenNLP contains a Predictor abstraction that takes inputs, converts them to instances, feeds them through your model, and returns JSON-serializable results. Often you'd need to implement your own Predictor, but AllenNLP already has a SentenceTaggerPredictor that works perfectly here, so we can use it. It requires our model (for making predictions) and a dataset reader (for creating instances).
  • It has a predict method that just needs a sentence and returns (a JSON-serializable version of) the output dict from forward. Here tag_logits will be a (5, 3) array of logits, corresponding to the 3 possible tags for each of the 5 words.
  • To get the actual "predictions" we can just take the argmax.
  • And then use our vocabulary to find the predicted tags.
Looking for something else?  Check out our other tutorials.