Abstract

We build four new test sets for the Stanford Question Answering Dataset (SQuAD) and evaluate the ability of question-answering systems to generalize to new data. In the original Wikipedia domain, we find no evidence of adaptive overfitting despite several years of test set re-use. On datasets derived from New York Times articles, Reddit posts, and Amazon product reviews, we observe average performance drops of 3.8, 14.0, and 17.4 F1, respectively, across a broad range of models. In contrast, a strong human baseline matches or exceeds the performance of SQuAD models on the original domain and exhibits little to no drop in new domains. Taken together, our results confirm the surprising resilience of the holdout method and emphasize the need to move towards evaluation metrics that incorporate robustness to natural distribution shifts.
Model and human F1 scores on the original SQuAD v1.1 test set compared to our new test sets for a broad set of more than 100 models. Each point corresponds to a model evaluation, shown with 95% Student's-t confidence intervals (mostly covered by the point markers). The plots reveal three main phenomena: (i) There is no evidence of adaptive overfitting on SQuAD, (ii) all of the models suffer F1 drops on the new datasets, with the magnitude of the drop strongly depending on the corpus, and (iii) humans are substantially more robust to natural distribution shifts than the models. The slopes of the linear fits are 0.92, 1.02, 1.19, and 1.36, respectively. This means that every point of F1 improvement on the original dataset translates into roughly 1 point of improvement on our new datasets.

Leaderboards

Download Datasets


All the datasets are distributed under the CC BY 4.0 license.
Datasets are also available via huggingface/datasets .
!pip install datasets
from datasets import load_dataset
# One of 'new-wiki', 'nyt', 'reddit', 'amazon'
dataset = load_dataset('squadshifts', 'reddit')

Explore Datasets

Paper

Acknowledgements

We thank Pranav Rajpurkar, Robin Jia, and Percy Liang for providing us with the original SQuAD data generation pipeline and answering our many questions about the SQuAD dataset. We thank Nelson Liu for generously providing a large number of the SQuAD models we evaluated, and we thank the Codalab team for supporting our model evaluation efforts. This research was generously supported in part by the National Science Foundation Graduate Research Fellowship Program under Grant No. DGE 1752814 ABC, an Amazon AWS AI Research Award, and a gift from Microsoft Research.