Identifying and Benchmarking Natural Out-of-Context Prediction Problems

Deep learning systems frequently fail at out-of-context (OOC) prediction, the problem of making reliable predictions on uncommon or unusual inputs or subgroups of the training distribution. To this end, a number of benchmarks for measuring OOC performance have been recently introduced. In this work, we introduce a framework unifying the literature on OOC performance measurement, and demonstrate how rich auxiliary information can be leveraged to identify candidate sets of OOC examples in existing datasets. We present NOOCH: a suite of naturallyoccurring “challenge sets”, and show how varying notions of context can be used to probe specific OOC failure modes. Experimentally, we explore the tradeoffs between various learning approaches on these challenge sets and demonstrate how the choices made in designing OOC benchmarks can yield varying conclusions.

[1]  Alexander D'Amour,et al.  Underspecification Presents Challenges for Credibility in Modern Machine Learning , 2020, J. Mach. Learn. Res..

[2]  Fei-Fei Li,et al.  Modeling mutual context of object and human pose in human-object interaction activities , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[3]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[4]  Antonio Torralba,et al.  Building the gist of a scene: the role of global image features in recognition. , 2006, Progress in brain research.

[5]  Antonio Torralba,et al.  Context models and out-of-context objects , 2012, Pattern Recognit. Lett..

[6]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[7]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Vittorio Ferrari,et al.  COCO-Stuff: Thing and Stuff Classes in Context , 2016, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[9]  D.M. Cohen,et al.  The Combinatorial Design Approach to Automatic Test Generation , 1996, IEEE Softw..

[10]  Hugo Larochelle,et al.  Learning a Universal Template for Few-shot Dataset Generalization , 2021, ICML.

[11]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[12]  Claudio S. Pinhanez,et al.  Using approximate models as source of contextual information for vision processing , 1995 .

[13]  Percy Liang,et al.  Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization , 2019, ArXiv.

[14]  Percy Liang,et al.  Robustness to Spurious Correlations via Human Annotations , 2020, ICML.

[15]  Carolyn Penstein Rosé,et al.  Stress Test Evaluation for Natural Language Inference , 2018, COLING.

[16]  Aaron C. Courville,et al.  Out-of-Distribution Generalization via Risk Extrapolation (REx) , 2020, ICML.

[17]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[18]  David Lopez-Paz,et al.  Linear unit-tests for invariance discovery , 2021, ArXiv.

[19]  Prateek Jain,et al.  The Pitfalls of Simplicity Bias in Neural Networks , 2020, NeurIPS.

[20]  Matthias Bethge,et al.  ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness , 2018, ICLR.

[21]  Christina Heinze-Deml,et al.  Conditional variance penalties and domain shift robustness , 2017, Machine Learning.

[22]  Tengyu Ma,et al.  In-N-Out: Pre-Training and Self-Training using Auxiliary Information for Out-of-Distribution Robustness , 2020, ICLR.

[23]  I. Biederman,et al.  Scene perception: Detecting and judging objects undergoing relational violations , 1982, Cognitive Psychology.

[24]  Richard Zemel,et al.  Fairness and Robustness in Invariant Learning: A Case Study in Toxicity Classification , 2020, ArXiv.

[25]  Antonio Torralba,et al.  Using the Forest to See the Trees: A Graphical Model Relating Features, Objects, and Scenes , 2003, NIPS.

[26]  Ryan V. Ringer,et al.  The spatiotemporal dynamics of scene gist recognition. , 2014, Journal of experimental psychology. Human perception and performance.

[27]  Daphne Koller,et al.  Learning Spatial Context: Using Stuff to Find Things , 2008, ECCV.

[28]  Gustavo Carneiro,et al.  Hidden stratification causes clinically meaningful failures in machine learning for medical imaging , 2019, CHIL.

[29]  Junmo Kim,et al.  Learning Not to Learn: Training Deep Neural Networks With Biased Data , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  John Duchi,et al.  Statistics of Robust Optimization: A Generalized Empirical Likelihood Approach , 2016, Math. Oper. Res..

[31]  Noah A. Smith,et al.  Evaluating Models’ Local Decision Boundaries via Contrast Sets , 2020, FINDINGS.

[32]  Gang Niu,et al.  Does Distributionally Robust Supervised Learning Give Robust Classifiers? , 2016, ICML.

[33]  Richard Zemel,et al.  Environment Inference for Invariant Learning , 2021, ICML.

[34]  Ellie Pavlick,et al.  Predicting Inductive Biases of Pre-Trained Models , 2021, ICLR.

[35]  Kevin Wu,et al.  Learning scene gist with convolutional neural networks to improve object recognition , 2018, 2018 52nd Annual Conference on Information Sciences and Systems (CISS).

[36]  Jonas Peters,et al.  Causal inference by using invariant prediction: identification and confidence intervals , 2015, 1501.01332.

[37]  Ross B. Girshick,et al.  Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[38]  John C. Duchi,et al.  Variance-based Regularization with Convex Objectives , 2016, NIPS.

[39]  Bradley C. Love,et al.  A Too-Good-to-be-True Prior to Reduce Shortcut Reliance , 2021, ArXiv.

[40]  Boris Katz,et al.  ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models , 2019, NeurIPS.

[41]  David Lopez-Paz,et al.  Invariant Risk Minimization , 2019, ArXiv.

[42]  Antonio Torralba,et al.  A Tree-Based Context Model for Object Recognition , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Alexander S. Ecker,et al.  Benchmarking Robustness in Object Detection: Autonomous Driving when Winter is Coming , 2019, ArXiv.

[44]  R. Hofmann-Wellenhof,et al.  Association Between Surgical Skin Markings in Dermoscopic Images and Diagnostic Performance of a Deep Learning Convolutional Neural Network for Melanoma Recognition. , 2019, JAMA dermatology.

[45]  Hayit Greenspan,et al.  Finding Pictures of Objects in Large Collections of Images , 1996, Object Representation in Computer Vision.

[46]  N. Meinshausen,et al.  Anchor regression: Heterogeneous data meet causality , 2018, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[47]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[48]  J. Dunning The elephant in the room. , 2013, European journal of cardio-thoracic surgery : official journal of the European Association for Cardio-thoracic Surgery.

[49]  Pierre Isabelle,et al.  A Challenge Set Approach to Evaluating Machine Translation , 2017, EMNLP.

[50]  Antonio Torralba,et al.  Contextual Priming for Object Detection , 2003, International Journal of Computer Vision.

[51]  Aaron C. Courville,et al.  FiLM: Visual Reasoning with a General Conditioning Layer , 2017, AAAI.

[52]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[53]  Bernt Schiele,et al.  Not Using the Car to See the Sidewalk — Quantifying and Controlling the Effects of Context in Classification and Segmentation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[54]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[55]  David Lopez-Paz,et al.  In Search of Lost Domain Generalization , 2020, ICLR.

[56]  Boyang Li,et al.  An Empirical Study on the Relation Between Network Interpretability and Adversarial Robustness , 2020, SN Computer Science.

[57]  Pietro Perona,et al.  Microsoft COCO: Common Objects in Context , 2014, ECCV.

[58]  Andrew Slavin Ross,et al.  Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients , 2017, AAAI.

[59]  Clémentine Nebut,et al.  Automatic test generation: a use case driven approach , 2006, IEEE Transactions on Software Engineering.

[60]  Aleksander Madry,et al.  BREEDS: Benchmarks for Subpopulation Shift , 2020, ICLR.

[61]  Sanchez Martin Jose Ignacio,et al.  Robustness and Explainability of Artificial Intelligence , 2020 .

[62]  R. Rockafellar,et al.  Conditional Value-at-Risk for General Loss Distributions , 2001 .

[63]  Aleksander Madry,et al.  Noise or Signal: The Role of Image Backgrounds in Object Recognition , 2020, ICLR.

[64]  Dawn Song,et al.  Natural Adversarial Examples , 2019, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  M. Bethge,et al.  Shortcut learning in deep neural networks , 2020, Nature Machine Intelligence.

[66]  M. Chun,et al.  Contextual cueing of visual attention , 2022 .