Bayesian Deep Learning and a Probabilistic Perspective of Generalization

The key distinguishing property of a Bayesian approach is marginalization, rather than using a single setting of weights. Bayesian marginalization can particularly improve the accuracy and calibration of modern deep neural networks, which are typically underspecified by the data, and can represent many compelling but different solutions. We show that deep ensembles provide an effective mechanism for approximate Bayesian marginalization, and propose a related approach that further improves the predictive distribution by marginalizing within basins of attraction, without significant overhead. We also investigate the prior over functions implied by a vague distribution over neural network weights, explaining the generalization properties of such models from a probabilistic perspective. From this perspective, we explain results that have been presented as mysterious and distinct to neural network generalization, such as the ability to fit images with random labels, and show that these results can be reproduced with Gaussian processes. We also show that Bayesian model averaging alleviates double descent, resulting in monotonic performance improvements with increased flexibility. Finally, we provide a Bayesian perspective on tempering for calibrating predictive distributions.

[1]  Micah Goldblum,et al.  Understanding Generalization through Visualizations , 2019, ICBINB@NeurIPS.

[2]  Thomas B. Schön,et al.  Evaluating Scalable Bayesian Deep Learning Methods for Robust Computer Vision , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[3]  Jürgen Schmidhuber,et al.  Flat Minima , 1997, Neural Computation.

[4]  Stefano Soatto,et al.  Emergence of Invariance and Disentanglement in Deep Representations , 2017, 2018 Information Theory and Applications Workshop (ITA).

[5]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[6]  Pier Giovanni Bissiri,et al.  A general framework for updating belief distributions , 2013, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[7]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[8]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[9]  Klamer Schutte,et al.  The Functional Neural Process , 2019, NeurIPS.

[10]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[11]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[12]  Andrew R. Barron,et al.  Minimum complexity density estimation , 1991, IEEE Trans. Inf. Theory.

[13]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[14]  Joel Nothman,et al.  SciPy 1.0-Fundamental Algorithms for Scientific Computing in Python , 2019, ArXiv.

[15]  Carl E. Rasmussen,et al.  Gaussian processes for machine learning , 2005, Adaptive computation and machine learning.

[16]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[17]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[18]  Sebastian Nowozin,et al.  How Good is the Bayes Posterior in Deep Neural Networks Really? , 2020, ICML.

[19]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[20]  S. MacEachern,et al.  Calibrated Bayes factors for model comparison , 2019, Journal of Statistical Computation and Simulation.

[21]  D. Mackay,et al.  Introduction to Gaussian processes , 1998 .

[22]  Temple F. Smith Occam's razor , 1980, Nature.

[23]  Balaji Lakshminarayanan,et al.  Deep Ensembles: A Loss Landscape Perspective , 2019, ArXiv.

[24]  Andrew Gordon Wilson,et al.  The Case for Bayesian Deep Learning , 2020, ArXiv.

[25]  Mikhail Belkin,et al.  Reconciling modern machine-learning practice and the classical bias–variance trade-off , 2018, Proceedings of the National Academy of Sciences.

[26]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[27]  Didrik Nielsen,et al.  Fast and Scalable Bayesian Deep Learning by Weight-Perturbation in Adam , 2018, ICML.

[28]  H. Rue,et al.  Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations , 2009 .

[29]  N. Hjort,et al.  On Bayesian consistency , 2001 .

[30]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[31]  Peter Cheeseman,et al.  Bayesian Methods for Adaptive Models , 2011 .

[32]  Guodong Zhang,et al.  Functional Variational Bayesian Neural Networks , 2019, ICLR.

[33]  David J. C. MacKay,et al.  Information Theory, Inference, and Learning Algorithms , 2004, IEEE Transactions on Information Theory.

[34]  Thomas G. Dietterich,et al.  Benchmarking Neural Network Robustness to Common Corruptions and Perturbations , 2018, ICLR.

[35]  Mohamed Zaki,et al.  Uncertainty in Neural Networks: Bayesian Ensembling , 2018, ArXiv.

[36]  J. T. Spooner,et al.  Adaptive and Learning Systems for Signal Processing, Communications, and Control , 2006 .

[37]  Eric T. Nalisnick On Priors for Bayesian Neural Networks , 2018 .

[38]  Dmitry Vetrov,et al.  Pitfalls of In-Domain Uncertainty Estimation and Ensembling in Deep Learning , 2020, ICLR.

[39]  Andres R. Masegosa,et al.  Learning under Model Misspecification: Applications to Variational and Ensemble methods , 2019, NeurIPS.

[40]  Tom Minka,et al.  Expectation Propagation for approximate Bayesian inference , 2001, UAI.

[41]  Alex Kendall,et al.  What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017, NIPS.

[42]  David Mackay,et al.  Probable networks and plausible predictions - a review of practical Bayesian methods for supervised neural networks , 1995 .

[43]  Tengyu Ma,et al.  Optimal Regularization Can Mitigate Double Descent , 2020, ICLR.

[44]  Tim Pearce,et al.  Uncertainty in Neural Networks: Approximately Bayesian Ensembling , 2018, AISTATS.

[45]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[46]  Stephen J. Roberts,et al.  Introducing an Explicit Symplectic Integration Scheme for Riemannian Manifold Hamiltonian Monte Carlo , 2019, ArXiv.

[47]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[48]  Fred A. Hamprecht,et al.  Essentially No Barriers in Neural Network Energy Landscape , 2018, ICML.

[49]  William Francis Darnieder Bayesian Methods for Data-Dependent Priors , 2011 .

[50]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[51]  Andrew Gordon Wilson,et al.  Subspace Inference for Bayesian Deep Learning , 2019, UAI.

[52]  Weifeng Liu,et al.  Adaptive and Learning Systems for Signal Processing, Communication, and Control , 2010 .

[53]  Tom Minka,et al.  Automatic Choice of Dimensionality for PCA , 2000, NIPS.

[54]  John Langford,et al.  (Not) Bounding the True Error , 2001, NIPS.

[55]  Andrew Gordon Wilson,et al.  Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs , 2018, NeurIPS.

[56]  Andrew Gordon Wilson,et al.  GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration , 2018, NeurIPS.

[57]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[58]  Thomas P. Minka,et al.  Bayesian model averaging is not model combination , 2002 .

[59]  Andrea Vedaldi,et al.  Deep Image Prior , 2017, International Journal of Computer Vision.

[60]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[61]  Tong Zhang,et al.  Information-theoretic upper and lower bounds for statistical estimation , 2006, IEEE Transactions on Information Theory.

[62]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[63]  Boaz Barak,et al.  Deep double descent: where bigger models and more data hurt , 2019, ICLR.

[64]  Nasser M. Nasrabadi,et al.  Pattern Recognition and Machine Learning , 2006, Technometrics.

[65]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[66]  Thijs van Ommen,et al.  Inconsistency of Bayesian Inference for Misspecified Linear Models, and a Proposal for Repairing It , 2014, 1412.3730.

[67]  Quoc V. Le,et al.  A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.

[68]  Benjamin Guedj,et al.  A Primer on PAC-Bayesian Learning , 2019, ICML 2019.

[69]  Dustin Tran,et al.  Reliable Uncertainty Estimates in Deep Neural Networks using Noise Contrastive Priors , 2018, ArXiv.

[70]  David Barber,et al.  A Scalable Laplace Approximation for Neural Networks , 2018, ICLR.

[71]  M. Opper,et al.  On the ability of the optimal perceptron to generalise , 1990 .

[72]  Stanislav Fort,et al.  Large Scale Structure of Neural Network Loss Landscapes , 2019, NeurIPS.

[73]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[74]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[75]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Rianne de Heide,et al.  Safe-Bayesian Generalized Linear Regression , 2019, AISTATS.

[77]  Srivatsan Srinivasan,et al.  Output-Constrained Bayesian Neural Networks , 2019, ArXiv.

[78]  Peter Grünwald,et al.  The Safe Bayesian - Learning the Learning Rate via the Mixability Gap , 2012, ALT.

[79]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[80]  Andrew Gordon Wilson,et al.  A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.

[81]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[82]  Matthew J. Beal Variational algorithms for approximate Bayesian inference , 2003 .

[83]  Mehryar Mohri,et al.  Rademacher Complexity Bounds for Non-I.I.D. Processes , 2008, NIPS.

[84]  Hossein Mobahi,et al.  Fantastic Generalization Measures and Where to Find Them , 2019, ICLR.