How Good is the Bayes Posterior in Deep Neural Networks Really?

During the past five years the Bayesian deep learning community has developed increasingly accurate and efficient approximate inference procedures that allow for Bayesian inference in deep neural networks. However, despite this algorithmic progress and the promise of improved uncertainty quantification and sample efficiency there are---as of early 2020---no publicized deployments of Bayesian neural networks in industrial practice. In this work we cast doubt on the current understanding of Bayes posteriors in popular deep neural networks: we demonstrate through careful MCMC sampling that the posterior predictive induced by the Bayes posterior yields systematically worse predictions compared to simpler methods including point estimates obtained from SGD. Furthermore, we demonstrate that predictive performance is improved significantly through the use of a "cold posterior" that overcounts evidence. Such cold posteriors sharply deviate from the Bayesian paradigm but are commonly used as heuristic in Bayesian deep learning papers. We put forward several hypotheses that could explain cold posteriors and evaluate the hypotheses through experiments. Our work questions the goal of accurate posterior approximations in Bayesian deep learning: If the true Bayes posterior is poor, what is the use of more accurate approximations? Instead, we argue that it is timely to focus on understanding the origin of the improved performance of cold posteriors.

[1]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .

[2]  H. L. Gray,et al.  On Bias Reduction in Estimation , 1971 .

[3]  Wang,et al.  Replica Monte Carlo simulation of spin glasses. , 1986, Physical review letters.

[4]  S. Duane,et al.  Hybrid Monte Carlo , 1987 .

[5]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[6]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[7]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[8]  David R. Wolf,et al.  Estimating functions of probability distributions from a finite set of samples. , 1994, Physical review. E, Statistical physics, plasmas, fluids, and related interdisciplinary topics.

[9]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[10]  F. Komaki On asymptotic properties of predictive distributions , 1996 .

[11]  Y. Sugita,et al.  Replica-exchange molecular dynamics method for protein folding , 1999 .

[12]  William Bialek,et al.  Entropy and Inference, Revisited , 2001, NIPS.

[13]  Galin L. Jones On the Markov chain central limit theorem , 2004, math/0409112.

[14]  Michael W Deem,et al.  Parallel tempering: theory, applications, and new perspectives. , 2005, Physical chemistry chemical physics : PCCP.

[15]  Tadayoshi Fushiki Bootstrap prediction and Bayesian prediction under misspecified models , 2005 .

[16]  Olle Häggström,et al.  On Variance Conditions for Markov Chain CLTs , 2007 .

[17]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[18]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[19]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[20]  Andrew Gelman,et al.  Handbook of Markov Chain Monte Carlo , 2011 .

[21]  Radford M. Neal MCMC Using Hamiltonian Dynamics , 2011, 1206.1901.

[22]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[23]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[24]  Lennard Jansen,et al.  Robust Bayesian inference under model misspecification , 2013 .

[25]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[26]  R. Ramamoorthi,et al.  On Posterior Concentration in Misspecified Models , 2013, 1312.4620.

[27]  M. Betancourt,et al.  Hamiltonian Monte Carlo for Hierarchical Models , 2013, 1312.0906.

[28]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[29]  Ryan Babbush,et al.  Bayesian Sampling Using Stochastic Gradient Thermostats , 2014, NIPS.

[30]  Tianqi Chen,et al.  Stochastic Gradient Hamiltonian Monte Carlo , 2014, ICML.

[31]  Andrew Gelman,et al.  The No-U-turn sampler: adaptively setting path lengths in Hamiltonian Monte Carlo , 2011, J. Mach. Learn. Res..

[32]  Thijs van Ommen,et al.  Inconsistency of Bayesian Inference for Misspecified Linear Models, and a Proposal for Repairing It , 2014, 1412.3730.

[33]  Max Welling,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS 2015.

[34]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[35]  Tianqi Chen,et al.  A Complete Recipe for Stochastic Gradient MCMC , 2015, NIPS.

[36]  Julien Cornebise,et al.  Weight Uncertainty in Neural Network , 2015, ICML.

[37]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[38]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[39]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[40]  Zhanxing Zhu,et al.  Covariance-Controlled Adaptive Langevin Thermostat for Large-Scale Bayesian Sampling , 2015, NIPS.

[41]  Lawrence Carin,et al.  On the Convergence of Stochastic Gradient MCMC Algorithms with High-Order Integrators , 2015, NIPS.

[42]  Lawrence Carin,et al.  Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks , 2015, AAAI.

[43]  Lili Zhao,et al.  Current Trends in Bayesian Methodology with Applications , 2016 .

[44]  Farhan Abrol,et al.  Variational Tempering , 2016, AISTATS.

[45]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[46]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[47]  Alexandre Lacoste,et al.  PAC-Bayesian Theory Meets Bayesian Inference , 2016, NIPS.

[48]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[49]  Luis Perez,et al.  The Effectiveness of Data Augmentation in Image Classification using Deep Learning , 2017, ArXiv.

[50]  Bohyung Han,et al.  Regularizing Deep Neural Networks by Noise: Its Interpretation and Optimization , 2017, NIPS.

[51]  Surya Ganguli,et al.  Deep Information Propagation , 2016, ICLR.

[52]  Daniel Flam-Shepherd Mapping Gaussian Process Priors to Bayesian Neural Networks , 2017 .

[53]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[54]  Dustin Tran,et al.  TensorFlow Distributions , 2017, ArXiv.

[55]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[56]  Carlo Luschi,et al.  Revisiting Small Batch Training for Deep Neural Networks , 2018, ArXiv.

[57]  J. Liao,et al.  Sharpening Jensen's Inequality , 2017, The American Statistician.

[58]  Dmitry P. Vetrov,et al.  Uncertainty Estimation via Stochastic Batch Normalization , 2018, ICLR.

[59]  Jaehoon Lee,et al.  Deep Neural Networks as Gaussian Processes , 2017, ICLR.

[60]  Guodong Zhang,et al.  Noisy Natural Gradient as Variational Inference , 2017, ICML.

[61]  Sebastian Nowozin,et al.  Debiasing Evidence Approximations: On Importance-weighted Autoencoders and Jackknife Variational Inference , 2018, ICLR.

[62]  David M. Blei,et al.  Noisin: Unbiased Regularization for Recurrent Neural Networks , 2018, ICML.

[63]  Arnaud Doucet,et al.  On the Selection of Initialization and Activation Function for Deep Neural Networks , 2018, ArXiv.

[64]  Boris Flach,et al.  Stochastic Normalizations as Bayesian Learning , 2018, ACCV.

[65]  Guodong Zhang,et al.  Eigenvalue Corrected Noisy Natural Gradient , 2018, ArXiv.

[66]  Richard E. Turner,et al.  Gaussian Process Behaviour in Wide Deep Neural Networks , 2018, ICLR.

[67]  B. Leimkuhler,et al.  Partitioned integrators for thermodynamic parameterization of neural networks , 2019, Foundations of Data Science.

[68]  Benedict J. Leimkuhler,et al.  TATi-Thermodynamic Analytics ToolkIt: TensorFlow-based software for posterior sampling in machine learning applications , 2019, ArXiv.

[69]  Zhanxing Zhu,et al.  The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects , 2018, ICML.

[70]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[71]  T. Lillicrap,et al.  Noise Contrastive Priors for Functional Uncertainty , 2018, UAI.

[72]  Sho Yaida,et al.  Fluctuation-dissipation relations for stochastic gradient descent , 2018, ICLR.

[73]  Laurence Aitchison,et al.  Deep Convolutional Networks as shallow Gaussian Processes , 2018, ICLR.

[74]  Hiroshi Inoue,et al.  Multi-Sample Dropout for Accelerated Training and Better Generalization , 2019, ArXiv.

[75]  A. Bhattacharya,et al.  Bayesian fractional posteriors , 2016, The Annals of Statistics.

[76]  Levent Sagun,et al.  A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks , 2019, ICML.

[77]  Nal Kalchbrenner,et al.  Bayesian Inference for Large Scale Image Classification , 2019, ArXiv.

[78]  Guodong Zhang,et al.  Functional Variational Bayesian Neural Networks , 2019, ICLR.

[79]  Richard E. Turner,et al.  Practical Deep Learning with Bayesian Principles , 2019, Neural Information Processing Systems.

[80]  Padhraic Smyth,et al.  Dropout as a Structured Shrinkage Prior , 2018, ICML.

[81]  Greg Yang,et al.  Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation , 2019, ArXiv.

[82]  Jaehoon Lee,et al.  Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes , 2018, ICLR.

[83]  Arno Solin,et al.  Applied Stochastic Differential Equations , 2019 .

[84]  Nicola Marzari,et al.  Bayesian Neural Networks at Finite Temperature , 2019, ArXiv.

[85]  Tim Pearce,et al.  Expressive Priors in Bayesian Neural Networks: Kernel Combinations and Periodic Functions , 2019, UAI.

[86]  Jimmy Ba,et al.  BatchEnsemble: Efficient Ensemble of Deep Neural Networks via Rank-1 Perturbation , 2019 .

[87]  Andrew Gordon Wilson,et al.  Cyclical Stochastic Gradient MCMC for Bayesian Deep Learning , 2019, ICLR.

[88]  Saurabh Singh,et al.  Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[89]  Andrew Gordon Wilson,et al.  The Case for Bayesian Deep Learning , 2020, ArXiv.

[90]  Andres R. Masegosa,et al.  Learning under Model Misspecification: Applications to Variational and Ensemble methods , 2019, NeurIPS.

[91]  Junpeng Lao,et al.  tfp.mcmc: Modern Markov Chain Monte Carlo Tools Built for Modern Hardware , 2020, ArXiv.

[92]  Pavel Izmailov,et al.  Bayesian Deep Learning and a Probabilistic Perspective of Generalization , 2020, NeurIPS.

[93]  Dmitry Vetrov,et al.  Pitfalls of In-Domain Uncertainty Estimation and Ensembling in Deep Learning , 2020, ICLR.