Localized Calibration: Metrics and Recalibration

Probabilistic classifiers output confidence scores along with their predictions, and these confidence scores must be well-calibrated (i.e. reflect the true probability of an event) to be meaningful and useful for downstream tasks. However, existing metrics for measuring calibration are insufficient. Commonly used metrics such as the expected calibration error (ECE) only measure global trends, making them ineffective for measuring the calibration of a particular sample or subgroup. At the other end of the spectrum, a fully individualized calibration error is in general intractable to estimate from finite samples. In this work, we propose the local calibration error (LCE), a finegrained calibration metric that spans the gap between fully global and fully individualized calibration. The LCE leverages learned features to automatically capture rich subgroups, and it measures the calibration error around each individual example via a similarity function. We then introduce a localized recalibration method, LoRe, that improves the LCE better than existing recalibration methods. Finally, we show that applying our recalibration method improves decisionmaking on downstream tasks.

[1]  Alexei A. Efros,et al.  What makes ImageNet good for transfer learning? , 2016, ArXiv.

[2]  Stefano Ermon,et al.  Individual Calibration with Randomized Forecasting , 2020, ICML.

[3]  Yiming Yang,et al.  On the Sentence Embeddings from BERT for Semantic Textual Similarity , 2020, EMNLP.

[4]  Guy N. Rothblum,et al.  Calibration for the (Computationally-Identifiable) Masses , 2017, ArXiv.

[5]  Peter A. Flach,et al.  Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration , 2019, NeurIPS.

[6]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[7]  Sunita Sarawagi,et al.  Trainable Calibration Measures For Neural Networks From Kernel Mean Embeddings , 2018, ICML.

[8]  Berkman Sahiner,et al.  Calibration of medical diagnostic classifier scores to the probability of disease , 2016, Statistical methods in medical research.

[9]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[10]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[11]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[12]  Andrey Malinin,et al.  Ensemble Distribution Distillation , 2019, ICLR.

[13]  Alex Kendall,et al.  What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017, NIPS.

[14]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[15]  Bianca Zadrozny,et al.  Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers , 2001, ICML.

[16]  Henri Berestycki,et al.  Asymptotics and calibration of local volatility models , 2002 .

[17]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[18]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[19]  A. Raftery,et al.  Probabilistic forecasts, calibration and sharpness , 2007 .

[20]  Jon M. Kleinberg,et al.  Inherent Trade-Offs in the Fair Determination of Risk Scores , 2016, ITCS.

[21]  Fredrik Lindsten,et al.  Calibration tests in multi-class classification: A unifying framework , 2019, NeurIPS.

[22]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[23]  Eric P. Xing,et al.  Real-to-Virtual Domain Unification for End-to-End Autonomous Driving , 2018, ECCV.

[24]  Marc Niethammer,et al.  Local Temperature Scaling for Probability Calibration , 2020, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[25]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[26]  Gábor Lugosi,et al.  Prediction, learning, and games , 2006 .