Amortized Conditional Normalized Maximum Likelihood: Reliable Out of Distribution Uncertainty Estimation

While deep neural networks provide good performance for a range of challenging tasks, calibration and uncertainty estimation remain major challenges, especially under distribution shift. In this paper, we propose the amortized conditional normalized maximum likelihood (ACNML) method as a scalable general-purpose approach for uncertainty estimation, calibration, and out-of-distribution robustness with deep networks. Our algorithm builds on the conditional normalized maximum likelihood (CNML) coding scheme, which has minimax optimal properties according to the minimum description length principle, but is computationally intractable to evaluate exactly for all but the simplest of model classes. We propose to use approximate Bayesian inference technqiues to produce a tractable approximation to the CNML distribution. Our approach can be combined with any approximate inference algorithm that provides tractable posterior densities over model parameters. We demonstrate that ACNML compares favorably to a number of prior techniques for uncertainty estimation in terms of calibration on out-of-distribution inputs.

[1]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[2]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[3]  Meir Feder,et al.  A New Look at an Old Problem: A Universal Learning Approach to Linear Regression , 2019, 2019 IEEE International Symposium on Information Theory (ISIT).

[4]  S. Weisberg,et al.  Residuals and Influence in Regression , 1982 .

[5]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[6]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[7]  Vladimir Vovk,et al.  Aggregating strategies , 1990, COLT '90.

[8]  Jasper Snoek,et al.  Efficient and Scalable Bayesian Neural Nets with Rank-1 Factors , 2020, ICML.

[9]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[10]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[11]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[12]  Thijs van Ommen,et al.  Inconsistency of Bayesian Inference for Misspecified Linear Models, and a Proposal for Repairing It , 2014, 1412.3730.

[13]  P. Grünwald The Minimum Description Length Principle (Adaptive Computation and Machine Learning) , 2007 .

[14]  Percy Liang,et al.  Understanding Black-box Predictions via Influence Functions , 2017, ICML.

[15]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[16]  J. Rissanen,et al.  Conditional NML Universal Models , 2007, 2007 Information Theory and Applications Workshop.

[17]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[18]  Jorma Rissanen,et al.  Fisher information and stochastic complexity , 1996, IEEE Trans. Inf. Theory.

[19]  Meir Feder,et al.  Universal Batch Learning with Log-Loss , 2018, 2018 IEEE International Symposium on Information Theory (ISIT).

[20]  T. Roos,et al.  Bayesian network structure learning using factorized NML universal models , 2008, 2008 Information Theory and Applications Workshop.

[21]  Julien Cornebise,et al.  Weight Uncertainty in Neural Network , 2015, ICML.

[22]  Meir Feder,et al.  Universal Supervised Learning for Individual Data , 2018, ArXiv.

[23]  Peter Grünwald,et al.  A tutorial introduction to the minimum description length principle , 2004, ArXiv.

[24]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[25]  Meir Feder,et al.  Deep pNML: Predictive Normalized Maximum Likelihood for Deep Neural Networks , 2019, ArXiv.

[26]  Yali Amit,et al.  Likelihood Regret: An Out-of-Distribution Detection Score For Variational Auto-encoder , 2020, NeurIPS.

[27]  Michael I. Jordan,et al.  A Swiss Army Infinitesimal Jackknife , 2018, AISTATS.

[28]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[29]  Alexei A. Efros,et al.  Test-Time Training with Self-Supervision for Generalization under Distribution Shifts , 2019, ICML.

[30]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[31]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[32]  Andrew Gordon Wilson,et al.  A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.

[33]  Sham M. Kakade,et al.  Worst-Case Bounds for Gaussian Process Models , 2005, NIPS.

[34]  J. Rissanen Stochastic Complexity in Statistical Inquiry Theory , 1989 .

[35]  David Barber,et al.  A Scalable Laplace Approximation for Neural Networks , 2018, ICLR.

[36]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[37]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.