What are Contrast Sets?

Standard test sets for supervised learning evaluate in-distribution generalization. Unfortunately, when a dataset has systematic gaps (e.g., annotation artifacts), these evaluations are misleading: a model can learn simple decision rules that perform well on the test set but do not capture a dataset's intended capabilities. We propose a new annotation paradigm for NLP that helps to close systematic gaps in the test data. In particular, after a dataset is constructed, we recommend that the dataset authors manually perturb the test instances in small but meaningful ways that (typically) change the gold label, creating contrast sets. Contrast sets provide a local view of a model's decision boundary, which can be used to more accurately evaluate a model's true linguistic capabilities. We demonstrate the efficacy of contrast sets by creating them for 10 diverse NLP datasets (e.g., DROP reading comprehension, UD parsing, IMDb sentiment analysis). Although our contrast sets are not explicitly adversarial, model performance is significantly lower on them than on the original test sets---up to 25% in some cases. We release our contrast sets as new evaluation benchmarks and encourage future dataset construction efforts to follow similar annotation processes.

Here is the paper.

Individual Datasets

Dataset Contrast Sets Type of NLP Task
BoolQ (Clark et al., 2019)
Data Reading Comprehension
DROP (Dua et al., 2019)
Data Reading Comprehension
MC-TACO (Zhou et al., 2019)
Data Reading Comprehension
ROPES (Lin et al., 2019)
Data Reading Comprehension
Quoref (Dasigi et al., 2019)
Data Reading Comprehension
IMDb Sentiment Analysis (Maas et al., 2011)
Data Classification
MATRES (Ning et al., 2018)
Data Classification
NLVR2 (Suhr et al., 2019)
Data Classification
PERSPECTRUM (Chen et al., 2019)
Data Classification
UD English (Nivre et al., 2016)
Data Parsing


	  title={Evaluating NLP Models via Contrast Sets},
	  author={Gardner, Matt and Artzi, Yoav and Basmova, Victoria and Berant, Jonathan and Bogin, Ben and Chen, Sihao
	    and Dasigi, Pradeep and Dua, Dheeru and Elazar, Yanai and Gottumukkala, Ananth and Gupta, Nitish
	    and Hajishirzi, Hanna and Ilharco, Gabriel and Khashabi, Daniel and Lin, Kevin and Liu, Jiangming
	    and Liu, Nelson F. and Mulcaire, Phoebe and Ning, Qiang and Singh, Sameer and Smith, Noah A.
	    and Subramanian, Sanjay and Tsarfaty, Reut and Wallace, Eric and Zhang, Ally and Zhou, Ben},
	  journal={arXiv preprint},