Calibrating Deep Neural Network Classifiers on Out-of-Distribution Datasets

To increase the trustworthiness of deep neural network (DNN) classifiers, an accurate prediction confidence that represents the true likelihood of correctness is crucial. Towards this end, many post-hoc calibration methods have been proposed to leverage a lightweight model to map the target DNN's output layer into a calibrated confidence. Nonetheless, on an out-of-distribution (OOD) dataset in practice, the target DNN can often mis-classify samples with a high confidence, creating significant challenges for the existing calibration methods to produce an accurate confidence. In this paper, we propose a new post-hoc confidence calibration method, called CCAC (Confidence Calibration with an Auxiliary Class), for DNN classifiers on OOD datasets. The key novelty of CCAC is an auxiliary class in the calibration model which separates mis-classified samples from correctly classified ones, thus effectively mitigating the target DNN's being confidently wrong. We also propose a simplified version of CCAC to reduce free parameters and facilitate transfer to a new unseen dataset. Our experiments on different DNN models, datasets and applications show that CCAC can consistently outperform the prior post-hoc calibration methods.

[1]  Hayder Radha,et al.  Deep learning algorithm for autonomous driving using GoogLeNet , 2017, 2017 IEEE Intelligent Vehicles Symposium (IV).

[2]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[3]  Rick Salay,et al.  Detecting Out-of-Distribution Inputs in Deep Neural Networks Using an Early-Layer Output , 2019, ArXiv.

[4]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[5]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[6]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[7]  Kibok Lee,et al.  A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks , 2018, NeurIPS.

[8]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[9]  R. Srikant,et al.  Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks , 2017, ICLR.

[10]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[11]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[12]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[13]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[14]  Suchi Saria,et al.  Can You Trust This Prediction? Auditing Pointwise Reliability After Learning , 2019, AISTATS.

[15]  Mehryar Mohri,et al.  Learning with Rejection , 2016, ALT.

[16]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[17]  Tengyu Ma,et al.  Verified Uncertainty Calibration , 2019, NeurIPS.

[18]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[19]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[20]  W. Brendel,et al.  Foolbox: A Python toolbox to benchmark the robustness of machine learning models , 2017 .

[21]  Peter A. Flach,et al.  Beyond temperature scaling: Obtaining well-calibrated multiclass probabilities with Dirichlet calibration , 2019, NeurIPS.

[22]  Bianca Zadrozny,et al.  Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers , 2001, ICML.

[23]  Jelmer M. Wolterink,et al.  Towards increased trustworthiness of deep learning segmentation methods on cardiac MRI , 2018, Medical Imaging: Image Processing.

[24]  Maya R. Gupta,et al.  To Trust Or Not To Trust A Classifier , 2018, NeurIPS.

[25]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[26]  Hal Daumé,et al.  Deep Unordered Composition Rivals Syntactic Methods for Text Classification , 2015, ACL.

[27]  Peter A. Flach,et al.  Beyond sigmoids: How to obtain well-calibrated probabilities from binary classifiers with beta calibration , 2017 .

[28]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[29]  Ran El-Yaniv,et al.  Selective Classification for Deep Neural Networks , 2017, NIPS.

[30]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[31]  Xiaofeng Liu,et al.  Deep Verifier Networks: Verification of Deep Discriminative Models with Deep Generative Models , 2019, AAAI.

[32]  Mehryar Mohri,et al.  Boosting with Abstention , 2016, NIPS.

[33]  Alex Kendall,et al.  What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017, NIPS.

[34]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[35]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[36]  Matthias Bethge,et al.  Foolbox v0.8.0: A Python toolbox to benchmark the robustness of machine learning models , 2017, ArXiv.

[37]  Mark J. F. Gales,et al.  Predictive Uncertainty Estimation via Prior Networks , 2018, NeurIPS.

[38]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[39]  Johannes Gehrke,et al.  Intelligible Models for HealthCare: Predicting Pneumonia Risk and Hospital 30-day Readmission , 2015, KDD.

[40]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.