We build four new test sets for the
Stanford Question Answering Dataset (SQuAD)
and evaluate the ability of question-answering systems to generalize to new
data. In the original Wikipedia domain, we find no evidence of adaptive
overfitting despite several years of test set re-use. On datasets derived
from New York Times articles, Reddit posts, and Amazon product reviews, we
observe average performance drops of 3.8, 14.0, and 17.4 F1, respectively,
across a broad range of models. In contrast, a strong human baseline matches
or exceeds the performance of SQuAD models on the original domain and
exhibits little to no drop in new domains. Taken together, our results
confirm the surprising resilience of the holdout method and emphasize the
need to move towards evaluation metrics that incorporate robustness to
natural distribution shifts.