We study how robust current ImageNet models are to distribution shifts arising from natural variations in datasets. Most research on robustness focuses on synthetic image perturbations (noise, simulated weather artifacts, adversarial examples, etc.), which leaves open how robustness on synthetic distribution shift relates to distribution shift arising in real data. Informed by an evaluation of 204 ImageNet models in 213 different test conditions, we find that there is often little to no transfer of robustness from current synthetic to natural distribution shift. Moreover, most current techniques provide no robustness to the natural distribution shifts in our testbed. The main exception is training on larger and more diverse datasets, which in multiple cases increases robustness, but is still far from closing the performance gaps. Our results indicate that distribution shifts arising in real data are currently an open research problem.

(Left) An overview of our testbed. Each cell represents a model evaluation on a distribution shift.
(Right) Testing robustness by plotting models in the testbed against a few natural distribution shifts.
Check out our interactive website to play with the data and our paper to read our analysis.


We have released our testbed, which comes with all our code and data, to allow researchers to use our framework to develop and evaluate new models and distribution shifts. To this end, we have made it very simple to add new models with a few lines of code (resnet50 example):

from torchvision.models import resnet50
from registry import registry
from models.model_base import Model, StandardTransform, StandardNormalization

        name = 'resnet50',
        transform = StandardTransform(img_resize_size=256, img_crop_size=224),
        normalization = StandardNormalization(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
        classifier_loader = lambda: resnet50(pretrained=True),
        eval_batch_size = 256,
It is similarly easy to add new datasets; check the repository for more information.
Running evaluations is also simple (either with existing models/datasets or ones you've added):

python eval.py --gpus 0 1 --models resnet50 densenet121 --eval-settings val imagenetv2-matched-frequency
Lastly, all our plotting code to visualize results is available. We hope you find these tools useful!


    title={Measuring Robustness to Natural Distribution Shifts in Image Classification},
    author={Rohan Taori and Achal Dave and Vaishaal Shankar and Nicholas Carlini and Benjamin Recht and Ludwig Schmidt},
    booktitle={Advances in Neural Information Processing Systems (NeurIPS)},