What Are Bayesian Neural Network Posteriors Really Like?

The posterior over Bayesian neural network (BNN) parameters is extremely high-dimensional and non-convex. For computational reasons, researchers approximate this posterior using inexpensive mini-batch methods such as meanfield variational inference or stochastic-gradient Markov chain Monte Carlo (SGMCMC). To investigate foundational questions in Bayesian deep learning, we instead use full-batch Hamiltonian Monte Carlo (HMC) on modern architectures. We show that (1) BNNs can achieve significant performance gains over standard training and deep ensembles; (2) a single long HMC chain can provide a comparable representation of the posterior to multiple shorter chains; (3) in contrast to recent studies, we find posterior tempering is not needed for near-optimal performance, with little evidence for a “cold posterior” effect, which we show is largely an artifact of data augmentation; (4) BMA performance is robust to the choice of prior scale, and relatively similar for diagonal Gaussian, mixture of Gaussian, and logistic priors; (5) Bayesian neural networks show surprisingly poor generalization under domain shift; (6) while cheaper alternatives such as deep ensembles and SGMCMC can provide good generalization, their predictive distributions are distinct from HMC. Notably, deep ensemble predictive distributions are similarly close to HMC as standard SGLD, and closer than standard variational inference.

[1]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[2]  Sebastian Nowozin,et al.  Deterministic Variational Inference for Robust Bayesian Neural Networks , 2018, ICLR.

[3]  Babak Shahbaba,et al.  Distributed Stochastic Gradient MCMC , 2014, ICML.

[4]  Jascha Sohl-Dickstein,et al.  Exact posterior distributions of wide Bayesian neural networks , 2020, ArXiv.

[5]  Miles Cranmer,et al.  A Bayesian neural network predicts the dissolution of compact planetary systems , 2021, Proceedings of the National Academy of Sciences.

[6]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[7]  Yarin Gal,et al.  A Systematic Comparison of Bayesian Deep Learning Robustness in Diabetic Retinopathy Tasks , 2019, ArXiv.

[8]  Shankar Krishnan,et al.  Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Yarin Gal,et al.  Uncertainty in Deep Learning , 2016 .

[10]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[12]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[13]  Andrew Gordon Wilson,et al.  Bayesian Deep Learning and a Probabilistic Perspective of Generalization , 2020, NeurIPS.

[14]  Ryan P. Adams,et al.  Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks , 2015, ICML.

[15]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[16]  Quoc V. Le,et al.  Searching for Activation Functions , 2018, arXiv.

[17]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[18]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[19]  Andrew Gordon Wilson,et al.  Dangers of Bayesian Model Averaging under Covariate Shift , 2021, NeurIPS.

[20]  Andrew Gordon Wilson,et al.  Cyclical Stochastic Gradient MCMC for Bayesian Deep Learning , 2019, ICLR.

[21]  Mohammad Emtiyaz Khan,et al.  Practical Deep Learning with Bayesian Principles , 2019, NeurIPS.

[22]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[23]  Kenji Doya,et al.  Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning , 2017, Neural Networks.

[24]  Yarin Gal,et al.  Liberty or Depth: Deep Bayesian Neural Nets Do Not Need Complex Weight Posterior Approximations , 2020, NeurIPS.

[25]  Andrew Gordon Wilson,et al.  Subspace Inference for Bayesian Deep Learning , 2019, UAI.

[26]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[27]  Christopher De Sa,et al.  AMAGOLD: Amortized Metropolis Adjustment for Efficient Stochastic Gradient MCMC , 2020, AISTATS.

[28]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[29]  Fred A. Hamprecht,et al.  Essentially No Barriers in Neural Network Energy Landscape , 2018, ICML.

[30]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[31]  Jaehoon Lee,et al.  Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes , 2018, ICLR.

[32]  R. Srikant,et al.  Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks , 2017, ICLR.

[33]  Alex Kendall,et al.  What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017, NIPS.

[34]  Andrew Gordon Wilson,et al.  Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs , 2018, NeurIPS.

[35]  Nicholas Geneva,et al.  Modeling the Dynamics of PDE Systems with Physics-Constrained Deep Auto-Regressive Networks , 2019, J. Comput. Phys..

[36]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[37]  Dmitry Vetrov,et al.  Pitfalls of In-Domain Uncertainty Estimation and Ensembling in Deep Learning , 2020, ICLR.

[38]  Mark van der Wilk,et al.  Bayesian Neural Network Priors Revisited , 2021, ArXiv.

[39]  Max Welling,et al.  Multiplicative Normalizing Flows for Variational Bayesian Neural Networks , 2017, ICML.

[40]  Jos'e Miguel Hern'andez-Lobato,et al.  Expressive yet Tractable Bayesian Deep Learning via Subnetwork Inference , 2020, ArXiv.

[41]  Richard E. Turner,et al.  On the Expressiveness of Approximate Inference in Bayesian Neural Networks , 2019, NeurIPS.

[42]  Tianqi Chen,et al.  A Complete Recipe for Stochastic Gradient MCMC , 2015, NIPS.

[43]  Kibok Lee,et al.  A Simple Unified Framework for Detecting Out-of-Distribution Samples and Adversarial Attacks , 2018, NeurIPS.

[44]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[45]  David Barber,et al.  A Scalable Laplace Approximation for Neural Networks , 2018, ICLR.

[46]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[47]  David Mackay,et al.  Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks , 1995 .

[48]  Ahn,et al.  Bayesian posterior sampling via stochastic gradient Fisher scoring Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring , 2012 .

[49]  Dmitry P. Vetrov,et al.  Variational Dropout Sparsifies Deep Neural Networks , 2017, ICML.

[50]  Soumya Ghosh,et al.  Quality of Uncertainty Quantification for Bayesian Neural Network Inference , 2019, ArXiv.

[51]  Bo Zhang,et al.  PBODL : Parallel Bayesian Online Deep Learning for Click-Through Rate Prediction in Tencent Advertising System , 2017, ArXiv.

[52]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[53]  Christopher Potts,et al.  Learning Word Vectors for Sentiment Analysis , 2011, ACL.

[54]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[55]  Andrew Gordon Wilson,et al.  A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.

[56]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[57]  Maurizio Filippone,et al.  All You Need is a Good Functional Prior for Bayesian Deep Learning , 2020, ArXiv.

[58]  Tianqi Chen,et al.  Stochastic Gradient Hamiltonian Monte Carlo , 2014, ICML.

[59]  Vincent Fortuin,et al.  Exact Langevin Dynamics with Stochastic Gradients , 2021, ArXiv.

[60]  Kevin Gimpel,et al.  Gaussian Error Linear Units (GELUs) , 2016 .

[61]  Ryan Babbush,et al.  Bayesian Sampling Using Stochastic Gradient Thermostats , 2014, NIPS.

[62]  M. Betancourt The Fundamental Incompatibility of Hamiltonian Monte Carlo and Data Subsampling , 2015, 1502.01510.

[63]  Razvan Pascanu,et al.  Overcoming catastrophic forgetting in neural networks , 2016, Proceedings of the National Academy of Sciences.

[64]  Jasper Snoek,et al.  Efficient and Scalable Bayesian Neural Nets with Rank-1 Factors , 2020, ICML.

[65]  Junpeng Lao,et al.  tfp.mcmc: Modern Markov Chain Monte Carlo Tools Built for Modern Hardware , 2020, ArXiv.

[66]  Guodong Zhang,et al.  Noisy Natural Gradient as Variational Inference , 2017, ICML.

[67]  Carl E. Rasmussen,et al.  Gaussian Processes for Machine Learning (GPML) Toolbox , 2010, J. Mach. Learn. Res..

[68]  Didrik Nielsen,et al.  Fast and Scalable Bayesian Deep Learning by Weight-Perturbation in Adam , 2018, ICML.

[69]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[70]  Adam D. Cobb,et al.  Scaling Hamiltonian Monte Carlo Inference for Bayesian Neural Networks with Symmetric Splitting , 2020, UAI.

[71]  Laurence Aitchison A statistical theory of cold posteriors in deep neural networks , 2021, ICLR.

[72]  David X. Li On Default Correlation: A Copula Function Approach , 1999 .

[73]  Sebastian Nowozin,et al.  How Good is the Bayes Posterior in Deep Neural Networks Really? , 2020, ICML.

[74]  Balaji Lakshminarayanan,et al.  Deep Ensembles: A Loss Landscape Perspective , 2019, ArXiv.