Posing reading comprehension as a generation problem provides a great deal of flexibility, allowing for open-ended questions with few restrictions on possible answers. However, progress is impeded by existing generation metrics, which rely on token overlap and are agnostic to the nuances of reading comprehension. To address this, we introduce a benchmark for training and evaluating generative reading comprehension metrics: MOdeling Correctness with Human Annotations. MOCHA contains 40K human judgement scores on model outputs from 6 diverse question answering datasets and an additional set of minimal pairs for evaluation. Using MOCHA, we train an evaluation metric: LERC, a Learned Evaluation metric for Reading Comprehension, to mimic human judgement scores.

Find out more in the links below.

  • Paper: EMNLP 2020 paper describing MOCHA and LERC.
  • Data: MOCHA contains ~40K instances split into train, validation, and test sets. It is distributed under the CC BY-SA 4.0 license.
  • Code: Coming soon! This will include code for reproducing LERC and an evaluation script. We will also be providing a trained version of LERC to be used for evaluation. The code base heavily relies on PyTorch, HuggingFace Transformers, and AllenNLP.
  • Leaderboard: Coming soon!
  • Demo: Coming soon! You'll be able to see how well a learned metric evaluates generated answers in comparison to other metrics like BLEU, METEOR, and BERTScore. The examples should give you some sense of what kinds of questions are in MOCHA, and what LERC can and cannot currently handle. If you find something interesting, let us know on twitter!
  • Citation:

                author={Anthony Chen and Gabriel Stanovsky and Sameer Singh and Matt Gardner},
                title={MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics},