Field-aware Calibration: A Simple and Empirically Strong Method for Reliable Probabilistic Predictions

It is often observed that the probabilistic predictions given by a machine learning model can disagree with averaged actual outcomes on specific subsets of data, which is also known as the issue of miscalibration. It is responsible for the unreliability of practical machine learning systems. For example, in online advertising, an ad can receive a click-through rate prediction of 0.1 over some population of users where its actual click rate is 0.15. In such cases, the probabilistic predictions have to be fixed before the system can be deployed. In this paper, we first introduce a new evaluation metric named field-level calibration error that measures the bias in predictions over the sensitive input field that the decision-maker concerns. We show that existing post-hoc calibration methods have limited improvements in the new field-level metric and other non-calibration metrics such as the AUC score. To this end, we propose Neural Calibration, a simple yet powerful post-hoc calibration method that learns to calibrate by making full use of the field-aware information over the validation set. We present extensive experiments on five large-scale datasets. The results showed that Neural Calibration significantly improves against uncalibrated predictions in common metrics such as the negative log-likelihood, Brier score and AUC, as well as the proposed field-level calibration error.

[1]  Zoubin Ghahramani,et al.  Deep Bayesian Active Learning with Image Data , 2017, ICML.

[2]  T. Therneau,et al.  Assessing calibration of prognostic risk scores , 2016, Statistical methods in medical research.

[3]  Kibok Lee,et al.  Training Confidence-calibrated Classifiers for Detecting Out-of-Distribution Samples , 2017, ICLR.

[4]  Qing He,et al.  Policy Optimization with Model-based Explorations , 2018, AAAI.

[5]  Stephen E. Fienberg,et al.  The Comparison and Evaluation of Forecasters. , 1983 .

[6]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[7]  Qing He,et al.  Warm Up Cold-start Advertisements: Improving CTR Predictions via Learning to Learn ID Embeddings , 2019, SIGIR.

[8]  Guy N. Rothblum,et al.  Multicalibration: Calibration for the (Computationally-Identifiable) Masses , 2018, ICML.

[9]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.

[10]  Krishna P. Gummadi,et al.  Fairness Constraints: Mechanisms for Fair Classification , 2015, AISTATS.

[11]  Ran El-Yaniv,et al.  Bias-Reduced Uncertainty Estimation for Deep Neural Classifiers , 2018, ICLR.

[12]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[13]  Jon M. Kleinberg,et al.  Inherent Trade-Offs in the Fair Determination of Risk Scores , 2016, ITCS.

[14]  Joaquin Quiñonero Candela,et al.  Practical Lessons from Predicting Clicks on Ads at Facebook , 2014, ADKDD'14.

[15]  Jun Sakuma,et al.  Fairness-aware Learning through Regularization Approach , 2011, 2011 IEEE 11th International Conference on Data Mining Workshops.

[16]  Wojciech Kotlowski,et al.  Online Isotonic Regression , 2016, COLT.

[17]  Jon M. Kleinberg,et al.  On Fairness and Calibration , 2017, NIPS.

[18]  Joaquin Quiñonero Candela,et al.  Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine , 2010, ICML.

[19]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[20]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[21]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[22]  Fuzhen Zhuang,et al.  Policy Gradients for Contextual Recommendations , 2018, WWW.

[23]  Aditya Krishna Menon,et al.  The cost of fairness in binary classification , 2018, FAT.

[24]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[25]  Jasper Snoek,et al.  Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling , 2018, ICLR.

[26]  Alexei A. Efros,et al.  Large-Scale Study of Curiosity-Driven Learning , 2018, ICLR.

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28]  Brian Neelon,et al.  Bayesian Isotonic Regression and Trend Analysis , 2004, Biometrics.

[29]  M. de Rijke,et al.  Calibration: A Simple Way to Improve Click Models , 2018, CIKM.

[30]  Martin Wattenberg,et al.  Ad click prediction: a view from the trenches , 2013, KDD.

[31]  Kevin B. Korb,et al.  Calibration and the Evaluation of Predictive Learners , 1999, International Joint Conference on Artificial Intelligence.

[32]  H. D. Brunk,et al.  Statistical inference under order restrictions : the theory and application of isotonic regression , 1973 .

[33]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.