Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift

Modern machine learning methods including deep learning have achieved great success in predictive accuracy for supervised learning tasks, but may still fall short in giving useful estimates of their predictive {\em uncertainty}. Quantifying uncertainty is especially critical in real-world settings, which often involve input distributions that are shifted from the training distribution due to a variety of factors including sample bias and non-stationarity. In such settings, well calibrated uncertainty estimates convey information about when a model's output should (or should not) be trusted. Many probabilistic deep learning methods, including Bayesian-and non-Bayesian methods, have been proposed in the literature for quantifying predictive uncertainty, but to our knowledge there has not previously been a rigorous large-scale empirical comparison of these methods under dataset shift. We present a large-scale benchmark of existing state-of-the-art methods on classification problems and investigate the effect of dataset shift on accuracy and calibration. We find that traditional post-hoc calibration does indeed fall short, as do several other previous methods. However, some methods that marginalize over models give surprisingly strong results across a broad spectrum of tasks.

[1]  Stephen E. Fienberg,et al.  The Comparison and Evaluation of Forecasters. , 1983 .

[2]  Yee Whye Teh,et al.  Hybrid Models with Deep and Invertible Features , 2019, ICML.

[3]  Ryan P. Adams,et al.  Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks , 2015, ICML.

[4]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[5]  Max Welling,et al.  Structured and Efficient Variational Deep Learning with Matrix Gaussian Posteriors , 2016, ICML.

[6]  David Duvenaud,et al.  Invertible Residual Networks , 2018, ICML.

[7]  Max Welling,et al.  Multiplicative Normalizing Flows for Variational Bayesian Neural Networks , 2017, ICML.

[8]  Benjamin Van Roy,et al.  Deep Exploration via Bootstrapped DQN , 2016, NIPS.

[9]  Ran El-Yaniv,et al.  Selective Classification for Deep Neural Networks , 2017, NIPS.

[10]  Alexander A. Alemi,et al.  Uncertainty in the Variational Information Bottleneck , 2018, ArXiv.

[11]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[12]  Jasper Snoek,et al.  Likelihood Ratios for Out-of-Distribution Detection , 2019, NeurIPS.

[13]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[14]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[15]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[16]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[17]  John Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[18]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[19]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[20]  Julien Cornebise,et al.  Weight Uncertainty in Neural Network , 2015, ICML.

[21]  Kibok Lee,et al.  A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks , 2018, NeurIPS.

[22]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[23]  Sebastian Thrun,et al.  Dermatologist-level classification of skin cancer with deep neural networks , 2017, Nature.

[24]  Dustin Tran,et al.  Flipout: Efficient Pseudo-Independent Weight Perturbations on Mini-Batches , 2018, ICLR.

[25]  Zachary C. Lipton,et al.  Troubling Trends in Machine Learning Scholarship , 2018, ACM Queue.

[26]  Alex Kendall,et al.  What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017, NIPS.

[27]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[28]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[29]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[30]  J. Brocker Reliability, Sufficiency, and the Decomposition of Proper Scores , 2008, 0806.0813.

[31]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[32]  R. Srikant,et al.  Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks , 2017, ICLR.

[33]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[34]  Kevin Gimpel,et al.  A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks , 2016, ICLR.

[35]  Thorsten Brants,et al.  One billion word benchmark for measuring progress in statistical language modeling , 2013, INTERSPEECH.

[36]  Christopher M. Bishop,et al.  Novelty detection and neural network validation , 1994 .

[37]  Sepp Hochreiter,et al.  Self-Normalizing Neural Networks , 2017, NIPS.

[38]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[39]  Peter Cheeseman,et al.  Bayesian Methods for Adaptive Models , 2011 .

[40]  Carl E. Rasmussen,et al.  Evaluating Predictive Uncertainty Challenge , 2005, MLCW.

[41]  John Schulman,et al.  Concrete Problems in AI Safety , 2016, ArXiv.

[42]  Sebastian Nowozin,et al.  Deterministic Variational Inference for Robust Bayesian Neural Networks , 2018, ICLR.

[43]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[44]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[45]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[46]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[47]  David J.C. Mackay,et al.  Density networks , 2000 .

[48]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[49]  D. Sculley,et al.  Google Vizier: A Service for Black-Box Optimization , 2017, KDD.

[50]  Jasper Snoek,et al.  Deep Bayesian Bandits Showdown: An Empirical Comparison of Bayesian Deep Networks for Thompson Sampling , 2018, ICLR.

[51]  James J. Little,et al.  Does Your Model Know the Digit 6 Is Not a Cat? A Less Biased Evaluation of "Outlier" Detectors , 2018, ArXiv.

[52]  Xin Zhang,et al.  End to End Learning for Self-Driving Cars , 2016, ArXiv.

[53]  D. Sculley,et al.  Winner's Curse? On Pace, Progress, and Empirical Rigor , 2018, ICLR.

[54]  James Hensman,et al.  Scalable Variational Gaussian Process Classification , 2014, AISTATS.

[55]  Max Welling,et al.  Semi-supervised Learning with Deep Generative Models , 2014, NIPS.

[56]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.