Measuring Calibration in Deep Learning

Overconfidence and underconfidence in machine learning classifiers is measured by calibration: the degree to which the probabilities predicted for each class match the accuracy of the classifier on that prediction. How one measures calibration remains a challenge: expected calibration error, the most popular metric, has numerous flaws which we outline, and there is no clear empirical understanding of how its choices affect conclusions in practice, and what recommendations there are to counteract its flaws. In this paper, we perform a comprehensive empirical study of choices in calibration measures including measuring all probabilities rather than just the maximum prediction, thresholding probability values, class conditionality, number of bins, bins that are adaptive to the datapoint density, and the norm used to compare accuracies to confidences. To analyze the sensitivity of calibration measures, we study the impact of optimizing directly for each variant with recalibration techniques. Across MNIST, Fashion MNIST, CIFAR-10/100, and ImageNet, we find that conclusions on the rank ordering of recalibration methods is drastically impacted by the choice of calibration measure. We find that conditioning on the class leads to more effective calibration evaluations, and that using the L2 norm rather than the L1 norm improves both optimization for calibration metrics and the rank correlation measuring metric consistency. Adaptive binning schemes lead to more stablity of metric rank ordering when the number of bins vary, and is also recommended. We open source a library for the use of our calibration measures.

[1]  Sunita Sarawagi,et al.  Trainable Calibration Measures For Neural Networks From Kernel Mean Embeddings , 2018, ICML.

[2]  Jeremy Nixon,et al.  Analyzing the role of model uncertainty for electronic health records , 2019, CHIL.

[3]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.

[4]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[5]  Thomas G. Dietterich,et al.  Deep Anomaly Detection with Outlier Exposure , 2018, ICLR.

[6]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[7]  A. H. Murphy,et al.  Verification of Probabilistic Predictions: A Brief Review , 1967 .

[8]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[9]  Kibok Lee,et al.  Training Confidence-calibrated Classifiers for Detecting Out-of-Distribution Samples , 2017, ICLR.

[10]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[11]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[12]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[13]  A. Raftery,et al.  Probabilistic forecasts, calibration and sharpness , 2007 .

[14]  Bohyung Han,et al.  Learning for Single-Shot Confidence Calibration in Deep Neural Networks Through Stochastic Inferences , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Jacob Roll,et al.  Evaluating model calibration in classification , 2019, AISTATS.

[16]  Jihoon Kim,et al.  Calibrating predictive model estimates to support personalized medicine , 2011, J. Am. Medical Informatics Assoc..

[17]  Ilias Tagkopoulos,et al.  From Data to Optimal Decision Making: A Data-Driven, Probabilistic Machine Learning Approach to Decision Support for Patients With Sepsis , 2015, JMIR medical informatics.

[18]  A. Dawid The Well-Calibrated Bayesian , 1982 .

[19]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Brendan T. O'Connor,et al.  Posterior calibration and exploratory analysis for natural language processing models , 2015, EMNLP.

[21]  Gopinath Chennupati,et al.  On Mixup Training: Improved Calibration and Predictive Uncertainty for Deep Neural Networks , 2019, NeurIPS.

[22]  D. Hosmer,et al.  Goodness of fit tests for the multiple logistic regression model , 1980 .

[23]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[24]  T. Therneau,et al.  Assessing calibration of prognostic risk scores , 2016, Statistical methods in medical research.

[25]  Stefano Ermon,et al.  Accurate Uncertainties for Deep Learning Using Calibrated Regression , 2018, ICML.

[26]  Stephen E. Fienberg,et al.  The Comparison and Evaluation of Forecasters. , 1983 .

[27]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[28]  Jon M. Kleinberg,et al.  Direct Uncertainty Prediction for Medical Second Opinions , 2018, ICML.

[29]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.