We build four new test sets for the 
            
Stanford Question Answering Dataset (SQuAD)
            and evaluate the ability of question-answering systems to generalize to new
            data. In the original Wikipedia domain, we find no evidence of adaptive
            overfitting despite several years of test set re-use. On datasets derived
            from New York Times articles, Reddit posts, and Amazon product reviews, we
            observe average performance drops of 3.8, 14.0, and 17.4 F1, respectively,
            across a broad range of models. In contrast, a strong human baseline matches
            or exceeds the performance of SQuAD models on the original domain and
            exhibits little to no drop in new domains. Taken together, our results
            confirm the surprising resilience of the holdout method and emphasize the
            need to move towards evaluation metrics that incorporate robustness to
            natural distribution shifts.