Layer-Stack Temperature Scaling

Recent works demonstrate that early layers in a neural network contain useful information for prediction. Inspired by this, we show that extending temperature scaling across all layers improves both calibration and accuracy. We call this procedure “layer-stack temperature scaling” (LATES). Informally, LATES grants each layer a weighted vote during inference. We evaluate it on five popular convolutional neural network architectures both in- and out-of-distribution and observe a consistent improvement over temperature scaling in terms of accuracy, calibration, and AUC. All conclusions are supported by comprehensive statistical analyses. Since LATES neither retrains the architecture nor introduces many more parameters, its advan-tages can be reaped without requiring additional data beyond what is used in temperature scaling. Finally, we show that combining LATES with Monte Carlo Dropout matches state-of-the-art results on CIFAR10/100.

[1]  H. Larochelle,et al.  Head2Toe: Utilizing Intermediate Representations for Better Transfer Learning , 2022, ICML.

[2]  Dustin Tran,et al.  Soft Calibration Objectives for Neural Networks , 2021, NeurIPS.

[3]  Behnam Neyshabur,et al.  Deep Learning Through the Lens of Example Difficulty , 2021, NeurIPS.

[4]  Michael W. Dusenberry,et al.  Uncertainty Baselines: Benchmarks for Uncertainty & Robustness in Deep Learning , 2021, ArXiv.

[5]  Ibrahim M. Alabdulmohsin,et al.  A Near-Optimal Algorithm for Debiasing Trained Machine Learning Models , 2021, NeurIPS.

[6]  Jasper Snoek,et al.  Second opinion needed: communicating uncertainty in medical machine learning , 2021, npj Digital Medicine.

[7]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[8]  Jasper Snoek,et al.  Training independent subnetworks for robust prediction , 2020, ICLR.

[9]  Michael Pfeiffer,et al.  Multi-Class Uncertainty Calibration via Mutual Information Maximization-based Binning , 2021, ICLR.

[10]  Cristian Sminchisescu,et al.  Calibration of Neural Networks using Splines , 2020, ICLR.

[11]  Ibrahim M. Alabdulmohsin,et al.  What Do Neural Networks Learn When Trained With Random Labels? , 2020, NeurIPS.

[12]  Dustin Tran,et al.  Simple and Principled Uncertainty Estimation with Deterministic Deep Learning via Distance Awareness , 2020, NeurIPS.

[13]  J. Gilmer,et al.  AugMix: A Simple Data Processing Method to Improve Robustness and Uncertainty , 2019, ICLR.

[14]  Rudolph Triebel,et al.  Non-Parametric Calibration for Classification , 2019, AISTATS.

[15]  Percy Liang,et al.  Verified Uncertainty Calibration , 2019, NeurIPS.

[16]  Peter A. Flach,et al.  Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration , 2019, NeurIPS.

[17]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[18]  Gopinath Chennupati,et al.  On Mixup Training: Improved Calibration and Predictive Uncertainty for Deep Neural Networks , 2019, NeurIPS.

[19]  Jeremy Nixon,et al.  Measuring Calibration in Deep Learning , 2019, CVPR Workshops.

[20]  Sunita Sarawagi,et al.  Calibration of Encoder Decoder Models for Neural Machine Translation , 2019, ArXiv.

[21]  Jon Kleinberg,et al.  Transfusion: Understanding Transfer Learning for Medical Imaging , 2019, NeurIPS.

[22]  Andrew Gordon Wilson,et al.  A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.

[23]  Nisheeth K. Vishnoi,et al.  Classification with Fairness Constraints: A Meta-Algorithm with Provable Guarantees , 2018, FAT.

[24]  Sunita Sarawagi,et al.  Trainable Calibration Measures For Neural Networks From Kernel Mean Embeddings , 2018, ICML.

[25]  Guillermo Sapiro,et al.  DNN or k-NN: That is the Generalize vs. Memorize Question , 2018, ArXiv.

[26]  Aditya Krishna Menon,et al.  The cost of fairness in binary classification , 2018, FAT.

[27]  Kilian Q. Weinberger,et al.  Multi-Scale Dense Networks for Resource Efficient Image Classification , 2017, ICLR.

[28]  Yoshua Bengio,et al.  A Closer Look at Memorization in Deep Networks , 2017, ICML.

[29]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[30]  Ran El-Yaniv,et al.  Selective Classification for Deep Neural Networks , 2017, NIPS.

[31]  Venkatesh Saligrama,et al.  Adaptive Neural Networks for Fast Test-Time Prediction , 2017, ArXiv.

[32]  Avi Feller,et al.  Algorithmic Decision Making and the Cost of Fairness , 2017, KDD.

[33]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[34]  Yoshua Bengio,et al.  Understanding intermediate layers using linear classifier probes , 2016, ICLR.

[35]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[36]  H. T. Kung,et al.  BranchyNet: Fast inference via early exiting from deep neural networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[37]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[38]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[39]  Ewout W Steyerberg,et al.  Net benefit approaches to the evaluation of prediction models, molecular markers, and diagnostic tests , 2016, British Medical Journal.

[40]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[41]  Peter A. Flach,et al.  Novel Decompositions of Proper Scoring Rules for Classification: Score Adjustment as Precursor to Calibration , 2015, ECML/PKDD.

[42]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[43]  Zhuowen Tu,et al.  Deeply-Supervised Nets , 2014, AISTATS.

[44]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[45]  Yoshua Bengio,et al.  How transferable are features in deep neural networks? , 2014, NIPS.

[46]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[47]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[48]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[49]  J. Eric Bickel,et al.  Some Comparisons among Quadratic, Spherical, and Logarithmic Scoring Rules , 2007, Decis. Anal..

[50]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[51]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[52]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[53]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[54]  David H. Wolpert,et al.  Stacked generalization , 1992, Neural Networks.

[55]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .