Verified Uncertainty Calibration

Applications such as weather forecasting and personalized medicine demand models that output calibrated probability estimates---those representative of the true likelihood of a prediction. Most models are not calibrated out of the box but are recalibrated by post-processing model outputs. We find in this work that popular recalibration methods like Platt scaling and temperature scaling are (i) less calibrated than reported, and (ii) current techniques cannot estimate how miscalibrated they are. An alternative method, histogram binning, has measurable calibration error but is sample inefficient---it requires $O(B/\epsilon^2)$ samples, compared to $O(1/\epsilon^2)$ for scaling methods, where $B$ is the number of distinct probabilities the model can output. To get the best of both worlds, we introduce the scaling-binning calibrator, which first fits a parametric function to reduce variance and then bins the function values to actually ensure calibration. This requires only $O(1/\epsilon^2 + B)$ samples. Next, we show that we can estimate a model's calibration error more accurately using an estimator from the meteorological community---or equivalently measure its calibration error with fewer samples ($O(\sqrt{B})$ instead of $O(B)$). We validate our approach with multiclass calibration experiments on CIFAR-10 and ImageNet, where we obtain a 35% lower calibration error than histogram binning and, unlike scaling methods, guarantees on true calibration. In these experiments, we also estimate the calibration error and ECE more accurately than the commonly used plugin estimators. We implement all these methods in a Python library: this https URL

[1]  Bianca Zadrozny,et al.  Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers , 2001, ICML.

[2]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[3]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[4]  Thomas G. Dietterich,et al.  Deep Anomaly Detection with Outlier Exposure , 2018, ICLR.

[5]  F. Harrell,et al.  Prognostic/Clinical Prediction Models: Multivariable Prognostic Models: Issues in Developing Models, Evaluating Assumptions and Adequacy, and Measuring and Reducing Errors , 2005 .

[6]  Peter A. Flach,et al.  Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration , 2019, NeurIPS.

[7]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[8]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[9]  Brendan T. O'Connor,et al.  Posterior calibration and exploratory analysis for natural language processing models , 2015, EMNLP.

[10]  Jochen Bröcker,et al.  Reliability, sufficiency, and the decomposition of proper scores , 2009 .

[11]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[12]  Peter A. Flach,et al.  Beyond sigmoids: How to obtain well-calibrated probabilities from binary classifiers with beta calibration , 2017 .

[13]  Christopher A. T. Ferro,et al.  A bias‐corrected decomposition of the Brier score , 2012 .

[14]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[15]  Lu Tian,et al.  A Calibration Metric for Risk Scores with Survival Data , 2019, MLHC.

[16]  Jeremy Nixon,et al.  Measuring Calibration in Deep Learning , 2019, CVPR Workshops.

[17]  Stanley Lemeshow,et al.  Multiple Logistic Regression , 2005 .

[18]  T. Therneau,et al.  Assessing calibration of prognostic risk scores , 2016, Statistical methods in medical research.

[19]  Warwick Turcker,et al.  Vector Calculus, Linear Algebra, and Differential Forms (Book) , 2003 .

[20]  C. Stein Estimation of the Mean of a Multivariate Normal Distribution , 1981 .

[21]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[22]  Leonard A. Smith,et al.  Increasing the Reliability of Reliability Diagrams , 2007 .

[23]  L. Deng,et al.  Calibration of Confidence Measures in Speech Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[25]  Percy Liang,et al.  Calibrated Structured Prediction , 2015, NIPS.

[26]  Dallas Card,et al.  The Importance of Calibration for Estimating Proportions from Annotations , 2018, NAACL.

[27]  Stefano Ermon,et al.  Calibrated Model-Based Deep Reinforcement Learning , 2019, ICML.

[28]  Jacob Roll,et al.  Evaluating model calibration in classification , 2019, AISTATS.

[29]  Stefano Ermon,et al.  Accurate Uncertainties for Deep Learning Using Calibrated Regression , 2018, ICML.

[30]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[31]  Adrian E. Raftery,et al.  Weather Forecasting with Ensemble Methods , 2005, Science.

[32]  A. H. Murphy,et al.  Reliability of Subjective Probability Forecasts of Precipitation and Temperature , 1977 .

[33]  Max Simchowitz,et al.  The Implicit Fairness Criterion of Unconstrained Learning , 2018, ICML.

[34]  S. Sheather Density Estimation , 2004 .

[35]  Guy N. Rothblum,et al.  Multicalibration: Calibration for the (Computationally-Identifiable) Masses , 2018, ICML.

[36]  Jihoon Kim,et al.  Calibrating predictive model estimates to support personalized medicine , 2011, J. Am. Medical Informatics Assoc..

[37]  Elizabeth C. Hirschman,et al.  Judgment under Uncertainty: Heuristics and Biases , 1974, Science.

[38]  Fredrik Lindsten,et al.  Calibration tests in multi-class classification: A unifying framework , 2019, NeurIPS.

[39]  Liam Paninski,et al.  Estimation of Entropy and Mutual Information , 2003, Neural Computation.

[40]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[41]  Jochen Bröcker,et al.  Estimating reliability and resolution of probability forecasts through decomposition of the empirical score , 2012, Climate Dynamics.

[42]  Kimin Lee,et al.  Using Pre-Training Can Improve Model Robustness and Uncertainty , 2019, ICML.

[43]  Stephen E. Fienberg,et al.  The Comparison and Evaluation of Forecasters. , 1983 .

[44]  A. V. D. Vaart Asymptotic Statistics: Delta Method , 1998 .

[45]  A. H. Murphy A New Vector Partition of the Probability Score , 1973 .

[46]  Alessandro Rinaldo,et al.  Distribution-Free Predictive Inference for Regression , 2016, Journal of the American Statistical Association.

[47]  Milos Hauskrecht,et al.  Binary Classifier Calibration: Non-parametric approach , 2014, ArXiv.

[48]  Vladimir Vovk,et al.  A tutorial on conformal prediction , 2007, J. Mach. Learn. Res..

[49]  A. Raftery,et al.  Probabilistic forecasts, calibration and sharpness , 2007 .