Bayesian Neural Network Priors Revisited

Isotropic Gaussian priors are the de facto standard for modern Bayesian neural network inference. However, it is unclear whether these priors accurately reflect our true beliefs about the weight distributions or give optimal performance. To find better priors, we study summary statistics of neural network weights in networks trained using SGD. We find that convolutional neural network (CNN) weights display strong spatial correlations, while fully connected networks (FCNNs) display heavy-tailed weight distributions. Building these observations into priors leads to improved performance on a variety of image classification datasets. Surprisingly, these priors mitigate the cold posterior effect in FCNNs, but slightly increase the cold posterior effect in ResNets.

[1]  Eric T. Nalisnick,et al.  A Scale Mixture Perspective of Multiplicative Noise in Neural Networks , 2015, 1506.03208.

[2]  Max Welling,et al.  The Deep Weight Prior , 2018, ICLR.

[3]  Yordan Zaykov,et al.  Interpretable Outcome Prediction with Sparse Bayesian Neural Networks in Intensive Care , 2019, ArXiv.

[4]  Alexandre Lacoste,et al.  PAC-Bayesian Theory Meets Bayesian Inference , 2016, NIPS.

[5]  Jasper Snoek,et al.  Efficient and Scalable Bayesian Neural Nets with Rank-1 Factors , 2020, ICML.

[6]  Marcus Gallagher,et al.  Richer priors for infinitely wide multi-layer perceptrons , 2019, ArXiv.

[7]  T. Bayes An essay towards solving a problem in the doctrine of chances , 2003 .

[8]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[9]  H. Robbins An Empirical Bayes Approach to Statistics , 1956 .

[10]  Radford M. Neal Bayesian training of backpropagation networks by the hybrid Monte-Carlo method , 1992 .

[11]  James Hensman,et al.  Learning Invariances using the Marginal Likelihood , 2018, NeurIPS.

[12]  Richard E. Turner,et al.  On the Expressiveness of Approximate Inference in Bayesian Neural Networks , 2019, NeurIPS.

[13]  Franccois-Xavier Briol,et al.  The Ridgelet Prior: A Covariance Function Approach to Prior Specification for Bayesian Neural Networks , 2020, ArXiv.

[14]  Pablo Mesejo,et al.  Understanding Priors in Bayesian Neural Networks at the Unit Level , 2018, ICML.

[15]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[16]  Jasper Snoek,et al.  The k-tied Normal Distribution: A Compact Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks , 2020, ICML.

[17]  Laurence Aitchison,et al.  Deep Convolutional Networks as shallow Gaussian Processes , 2018, ICLR.

[18]  Alexander Immer,et al.  Improving predictions of Bayesian neural networks via local linearization , 2020, ArXiv.

[19]  Eero P. Simoncelli,et al.  On Advances in Statistical Modeling of Natural Images , 2004, Journal of Mathematical Imaging and Vision.

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Eero P. Simoncelli Capturing Visual Image Properties with Probabilistic Models , 2009 .

[22]  Jasper Snoek,et al.  Hyperparameter Ensembles for Robustness and Uncertainty Quantification , 2020, NeurIPS.

[23]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[24]  Andrew Gordon Wilson,et al.  Bayesian Deep Learning and a Probabilistic Perspective of Generalization , 2020, NeurIPS.

[25]  Ryan P. Adams,et al.  Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks , 2015, ICML.

[26]  Osvaldo A. Martin,et al.  ArviZ a unified library for exploratory analysis of Bayesian models in Python , 2019, J. Open Source Softw..

[27]  Tim Pearce,et al.  Expressive Priors in Bayesian Neural Networks: Kernel Combinations and Periodic Functions , 2019, UAI.

[28]  Laurence Aitchison A statistical theory of cold posteriors in deep neural networks , 2021, ICLR.

[29]  D J Field,et al.  Relations between the statistics of natural images and the response properties of cortical cells. , 1987, Journal of the Optical Society of America. A, Optics and image science.

[30]  Yarin Gal,et al.  Radial Bayesian Neural Networks: Robust Variational Inference In Big Models , 2019, ArXiv.

[31]  Emile Fiesler,et al.  Do Backpropagation Trained Neural Networks have Normal Weight Distributions , 1993 .

[32]  Martin J. Wainwright,et al.  Scale Mixtures of Gaussians and the Statistics of Natural Images , 1999, NIPS.

[33]  Andreas Krause,et al.  PACOH: Bayes-Optimal Meta-Learning with PAC-Guarantees , 2020, ICML.

[34]  Alex Kendall,et al.  What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017, NIPS.

[35]  Mark van der Wilk,et al.  Correlated Weights in Infinite Limits of Deep Convolutional Neural Networks , 2021, UAI.

[36]  Carl Friedrich Gauss,et al.  Theoria Motvs Corporvm Coelestivm In Sectionibvs Conicis Solem Ambientivm , 2011 .

[37]  Luis Pedro Coelho Jug: Software for Parallel Reproducible Computation in Python , 2017 .

[38]  J. Schmidhuber,et al.  The Sacred Infrastructure for Computational Research , 2017, SciPy.

[39]  Andrew Gordon Wilson,et al.  Cyclical Stochastic Gradient MCMC for Bayesian Deep Learning , 2019, ICLR.

[40]  Ben Willmore,et al.  The Receptive-Field Organization of Simple Cells in Primary Visual Cortex of Ferrets under Natural Scene Stimulation , 2003, The Journal of Neuroscience.

[41]  Maurizio Filippone,et al.  All You Need is a Good Functional Prior for Bayesian Deep Learning , 2020, ArXiv.

[42]  Héctor J. Sussmann,et al.  Uniqueness of the weights for minimal feedforward nets with a given input-output map , 1992, Neural Networks.

[43]  Greg Yang,et al.  Tensor Programs I: Wide Feedforward or Recurrent Neural Networks of Any Architecture are Gaussian Processes , 2019, NeurIPS.

[44]  J. Lüroth,et al.  Vergleichung von zwei Werthen des wahrscheinlichen Fehlers , 2022 .

[45]  Dmitry P. Vetrov,et al.  Variational Dropout Sparsifies Deep Neural Networks , 2017, ICML.

[46]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[47]  R. Tempelman,et al.  Improving the computational efficiency of fully Bayes inference and assessing the effect of misspecification of hyperparameters in whole-genome prediction models , 2015, Genetics Selection Evolution.

[48]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[49]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[50]  Gintare Karolina Dziugaite,et al.  On the role of data in PAC-Bayes bounds , 2021, AISTATS.

[51]  Umut Simsekli,et al.  The Heavy-Tail Phenomenon in SGD , 2020, ArXiv.

[52]  Sebastian W. Ober,et al.  Global inducing point variational posteriors for Bayesian neural networks and deep Gaussian processes , 2020, ICML.

[53]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[54]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[55]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[56]  Richard E. Turner,et al.  Gaussian Process Behaviour in Wide Deep Neural Networks , 2018, ICLR.

[57]  José Miguel Hernández-Lobato,et al.  Variational Implicit Processes , 2018, ICML.

[58]  Andrew Gordon Wilson,et al.  Subspace Inference for Bayesian Deep Learning , 2019, UAI.

[59]  Thang D. Bui,et al.  Hierarchical Gaussian Process Priors for Bayesian Neural Network Weights , 2020, NeurIPS.

[60]  B. Leimkuhler,et al.  Rational Construction of Stochastic Numerical Methods for Molecular Sampling , 2012, 1203.5428.

[61]  Sebastian Nowozin,et al.  Deterministic Variational Inference for Robust Bayesian Neural Networks , 2018, ICLR.

[62]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[63]  Adam X. Yang,et al.  Deep kernel processes , 2020, ICML.

[64]  Guodong Zhang,et al.  Functional Variational Bayesian Neural Networks , 2019, ICLR.

[65]  Mohammad Emtiyaz Khan,et al.  Practical Deep Learning with Bayesian Principles , 2019, NeurIPS.

[66]  Eric T. Nalisnick On Priors for Bayesian Neural Networks , 2018 .

[67]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[68]  P. Billingsley,et al.  The Lindeberg-Lévy theorem for martingales , 1961 .

[69]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[70]  Samuel Kaski,et al.  Informative Gaussian Scale Mixture Priors for Bayesian Neural Networks , 2020, ArXiv.

[71]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[72]  Jaehoon Lee,et al.  Deep Neural Networks as Gaussian Processes , 2017, ICLR.

[73]  Student,et al.  THE PROBABLE ERROR OF A MEAN , 1908 .

[74]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[75]  Nal Kalchbrenner,et al.  Bayesian Inference for Large Scale Image Classification , 2019, ArXiv.

[76]  Aki Vehtari,et al.  Rank-Normalization, Folding, and Localization: An Improved Rˆ for Assessing Convergence of MCMC (with Discussion) , 2019, Bayesian Analysis.

[77]  Laurence Aitchison Why bigger is not always better: on finite and infinite neural networks , 2020, ICML.

[78]  Vincent Fortuin,et al.  Exact Langevin Dynamics with Stochastic Gradients , 2021, ArXiv.

[79]  Stefano Peluchetti,et al.  Stable behaviour of infinitely wide deep neural networks , 2020, AISTATS.

[80]  Soumya Ghosh,et al.  Model Selection in Bayesian Neural Networks via Horseshoe Priors , 2017, J. Mach. Learn. Res..

[81]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[82]  Jaehoon Lee,et al.  Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes , 2018, ICLR.

[83]  Gunnar Ratsch,et al.  Scalable Marginal Likelihood Estimation for Model Selection in Deep Learning , 2021, ICML.

[84]  Sebastian Nowozin,et al.  How Good is the Bayes Posterior in Deep Neural Networks Really? , 2020, ICML.

[85]  Maneesh Sahani,et al.  Evidence Optimization Techniques for Estimating Stimulus-Response Functions , 2002, NIPS.

[86]  Chulhee Lee,et al.  Analyzing weight distribution of neural networks , 1999, IJCNN'99. International Joint Conference on Neural Networks. Proceedings (Cat. No.99CH36339).

[87]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[88]  Max Welling,et al.  Multiplicative Normalizing Flows for Variational Bayesian Neural Networks , 2017, ICML.