Calibration with Bias-Corrected Temperature Scaling Improves Domain Adaptation Under Label Shift in Modern Neural Networks

Label shift refers to the phenomenon where the marginal probability p(y) of observing a particular class changes between the training and test distributions while the conditional probability p(x|y) stays fixed. This is relevant in settings such as medical diagnosis, where a classifier trained to predict disease based on observed symptoms may need to be adapted to a different distribution where the baseline frequency of the disease is higher. Given calibrated estimates of p(y|x), one can apply an EM algorithm to correct for the shift in class imbalance between the training and test distributions without ever needing to calculate p(x|y). Unfortunately, modern neural networks typically fail to produce well-calibrated probabilities, compromising the effectiveness of this approach. Although Temperature Scaling can greatly reduce miscalibration in these networks, it can leave behind a systematic bias in the probabilities that still poses a problem. To address this, we extend Temperature Scaling with class-specific bias parameters, which largely eliminates systematic bias in the calibrated probabilities and allows for effective domain adaptation under label shift. We term our calibration approach "Bias-Corrected Temperature Scaling". On experiments with CIFAR10, we find that EM with Bias-Corrected Temperature Scaling significantly outperforms both EM with Temperature Scaling and the recently-proposed Black-Box Shift Estimation.

[1]  Percy Liang,et al.  Calibrated Structured Prediction , 2015, NIPS.

[2]  Stefano Ermon,et al.  Estimating Uncertainty Online Against an Adversary , 2016, AAAI.

[3]  Alexander J. Smola,et al.  Detecting and Correcting for Label Shift with Black Box Predictors , 2018, ICML.

[4]  Kamyar Azizzadenesheli,et al.  Regularized Learning for Domain Adaptation under Label Shifts , 2019, ICLR.

[5]  Ran El-Yaniv,et al.  Selective Classification for Deep Neural Networks , 2017, NIPS.

[6]  Stephen E. Fienberg,et al.  The Comparison and Evaluation of Forecasters. , 1983 .

[7]  Bernhard Schölkopf,et al.  Domain Adaptation under Target and Conditional Shift , 2013, ICML.

[8]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[9]  Jacob Roll,et al.  Evaluating model calibration in classification , 2019, AISTATS.

[10]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.

[11]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[12]  Neil D. Lawrence,et al.  When Training and Test Sets Are Different: Characterizing Learning Transfer , 2009 .

[13]  Marco Saerens,et al.  Adjusting the Outputs of a Classifier to New a Priori Probabilities: A Simple Procedure , 2002, Neural Computation.

[14]  Amos Storkey,et al.  When Training and Test Sets are Different: Characterising Learning Transfer , 2013 .

[15]  Zachary C. Lipton,et al.  What is the Effect of Importance Weighting in Deep Learning? , 2018, ICML.

[16]  Bernhard Schölkopf,et al.  On causal and anticausal learning , 2012, ICML.

[17]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .