On Calibration and Out-of-domain Generalization

Out-of-domain (OOD) generalization is a significant challenge for machine learning models. Many techniques have been proposed to overcome this challenge, often focused on learning models with certain invariance properties. In this work, we draw a link between OOD performance and model calibration, arguing that calibration across multiple domains can be viewed as a special case of an invariant representation leading to better OOD generalization. Specifically, we show that under certain conditions, models which achieve multi-domain calibration are provably free of spurious correlations. This leads us to propose multi-domain calibration as a measurable and trainable surrogate for the OOD performance of a classifier. We therefore introduce methods that are easy to apply and allow practitioners to improve multi-domain calibration by training or modifying an existing model, leading to better performance on unseen domains. Using four datasets from the recently proposed WILDS OOD benchmark [23], as well as the Colored MNIST dataset [21], we demonstrate that training or tuning models so they are calibrated across multiple domains leads to significantly improved performance on unseen test domains. We believe this intriguing connection between calibration and OOD generalization is promising from both a practical and theoretical point of view.

[1]  Sunita Sarawagi,et al.  Trainable Calibration Measures For Neural Networks From Kernel Mean Embeddings , 2018, ICML.

[2]  Anne Driscoll,et al.  Using publicly available satellite imagery and deep learning to understand economic well-being in Africa , 2020, Nature Communications.

[3]  Jon M. Kleinberg,et al.  On Fairness and Calibration , 2017, NIPS.

[4]  Joris M. Mooij,et al.  Domain Adaptation by Using Causal Inference to Predict Invariant Conditional Distributions , 2017, NeurIPS.

[5]  Shaoqun Zeng,et al.  From Detection of Individual Metastases to Classification of Lymph Node Status at the Patient Level: The CAMELYON17 Challenge , 2019, IEEE Transactions on Medical Imaging.

[6]  Marcus A. Badgeley,et al.  Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study , 2018, PLoS medicine.

[7]  Nathan Srebro,et al.  Does Invariant Risk Minimization Capture Invariance? , 2021, ArXiv.

[8]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[9]  Jianmo Ni,et al.  Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects , 2019, EMNLP.

[10]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[11]  Philip H.S. Torr,et al.  Calibrating Deep Neural Networks using Focal Loss , 2020, NeurIPS.

[12]  N. Meinshausen,et al.  Anchor regression: Heterogeneous data meet causality , 2018, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[13]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[14]  Aaron C. Courville,et al.  Out-of-Distribution Generalization via Risk Extrapolation (REx) , 2020, ICML.

[15]  Gang Niu,et al.  Does Distributionally Robust Supervised Learning Give Robust Classifiers? , 2016, ICML.

[16]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[17]  Roi Reichart,et al.  Predicting In-Game Actions from Interviews of NBA Players , 2019, Computational Linguistics.

[18]  Pradeep Ravikumar,et al.  The Risks of Invariant Risk Minimization , 2020, ICLR.

[19]  Jacob Roll,et al.  Evaluating model calibration in classification , 2019, AISTATS.

[20]  Stephen E. Fienberg,et al.  The Comparison and Evaluation of Forecasters. , 1983 .

[21]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[22]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[23]  Christina Heinze-Deml,et al.  Invariant Causal Prediction for Nonlinear Models , 2017, Journal of Causal Inference.

[24]  Stefano Ermon,et al.  Accurate Uncertainties for Deep Learning Using Calibrated Regression , 2018, ICML.

[25]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[26]  Ruocheng Guo,et al.  Out-of-distribution Prediction with Invariant Risk Minimization: The Limitation and An Effective Fix , 2021, ArXiv.

[27]  Kate Saenko,et al.  Deep CORAL: Correlation Alignment for Deep Domain Adaptation , 2016, ECCV Workshops.

[28]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[29]  Gordon Christie,et al.  Functional Map of the World , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[30]  Sample Complexity of Uniform Convergence for Multicalibration , 2020, NeurIPS.

[31]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[32]  Guy N. Rothblum,et al.  Multicalibration: Calibration for the (Computationally-Identifiable) Masses , 2018, ICML.

[33]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.

[34]  Stephen P. Boyd,et al.  CVXPY: A Python-Embedded Modeling Language for Convex Optimization , 2016, J. Mach. Learn. Res..

[35]  David Lopez-Paz,et al.  Invariant Risk Minimization , 2019, ArXiv.

[36]  Jeremy Nixon,et al.  Measuring Calibration in Deep Learning , 2019, CVPR Workshops.

[37]  David Lopez-Paz,et al.  In Search of Lost Domain Generalization , 2020, ICLR.

[38]  P. Alam ‘A’ , 2021, Composites Engineering: An A–Z Guide.

[39]  Junmo Kim,et al.  Learning Not to Learn: Training Deep Neural Networks With Biased Data , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  P. Alam ‘L’ , 2021, Composites Engineering: An A–Z Guide.

[41]  Bernhard Schölkopf,et al.  On causal and anticausal learning , 2012, ICML.

[42]  Demis Hassabis,et al.  Improved protein structure prediction using potentials from deep learning , 2020, Nature.

[43]  Bohua Zhan,et al.  Smooth Manifolds , 2021, Arch. Formal Proofs.

[44]  Byron Boots,et al.  Intra Order-preserving Functions for Calibration of Multi-Class Neural Networks , 2020, NeurIPS.

[45]  Anja De Waegenaere,et al.  Robust Solutions of Optimization Problems Affected by Uncertain Probabilities , 2011, Manag. Sci..

[46]  Bianca Zadrozny,et al.  Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers , 2001, ICML.

[47]  Judea Pearl,et al.  A Probabilistic Calculus of Actions , 1994, UAI.

[48]  Lucy Vasserman,et al.  Nuanced Metrics for Measuring Unintended Bias with Real Data for Text Classification , 2019, WWW.

[49]  Shrey Desai,et al.  Calibration of Pre-trained Transformers , 2020, EMNLP.

[50]  Percy Liang,et al.  Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization , 2019, ArXiv.

[51]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[52]  Suchi Saria,et al.  Preventing Failures Due to Dataset Shift: Learning Predictive Models That Transport , 2018, AISTATS.

[53]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[54]  Michael I. Jordan,et al.  Transferable Calibration with Lower Bias and Variance in Domain Adaptation , 2020, NeurIPS.

[55]  P. Alam ‘T’ , 2021, Composites Engineering: An A–Z Guide.

[56]  Jonas Peters,et al.  Causal inference by using invariant prediction: identification and confidence intervals , 2015, 1501.01332.

[57]  Vladimir Vovk,et al.  Self-calibrating Probability Forecasting , 2003, NIPS.

[58]  Aaditya Ramdas,et al.  Distribution-free binary classification: prediction sets, confidence intervals and calibration , 2020, NeurIPS.

[59]  Mihaela van der Schaar,et al.  Generalization and Invariances in the Presence of Unobserved Confounding , 2020, ArXiv.

[60]  Illtyd Trethowan Causality , 1938 .

[61]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.