Why Cold Posteriors? On the Suboptimal Generalization of Optimal Bayes Estimates

Recent works have shown that the predictive accuracy of Bayesian deep learning models exhibit substantial improvements when the posterior is raised to a 1/T power with T < 1. In this work, we explore several possible reasons for this surprising behavior.

[1]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[2]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[3]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[4]  Yoshua Bengio,et al.  Semi-supervised Learning by Entropy Minimization , 2004, CAP.

[5]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[6]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[7]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[8]  Lennard Jansen,et al.  Robust Bayesian inference under model misspecification , 2013 .

[9]  Thijs van Ommen,et al.  Inconsistency of Bayesian Inference for Misspecified Linear Models, and a Proposal for Repairing It , 2014, 1412.3730.

[10]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[11]  Lawrence Carin,et al.  Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks , 2015, AAAI.

[12]  Guodong Zhang,et al.  Noisy Natural Gradient as Variational Inference , 2017, ICML.

[13]  David Barber,et al.  A Scalable Laplace Approximation for Neural Networks , 2018, ICLR.

[14]  Michael C. Abbott,et al.  Maximizing the information learned from finite data selects a simple model , 2017, Proceedings of the National Academy of Sciences.

[15]  Richard E. Turner,et al.  Gaussian Process Behaviour in Wide Deep Neural Networks , 2018, ICLR.

[16]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[17]  Xu Ji,et al.  Invariant Information Clustering for Unsupervised Image Classification and Segmentation , 2018, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[18]  David B. Dunson,et al.  Robust Bayesian Inference via Coarsening , 2015, Journal of the American Statistical Association.

[19]  Julien Mairal,et al.  On the Inductive Bias of Neural Tangent Kernels , 2019, NeurIPS.

[20]  Andrew Gordon Wilson,et al.  Cyclical Stochastic Gradient MCMC for Bayesian Deep Learning , 2019, ICLR.

[21]  Bastiaan S. Veeling,et al.  How Good is the Bayes Posterior in Deep Neural Networks Really? , 2020, ICML.

[22]  Pavel Izmailov,et al.  Bayesian Deep Learning and a Probabilistic Perspective of Generalization , 2020, NeurIPS.

[23]  Jasper Snoek,et al.  Cold Posteriors and Aleatoric Uncertainty , 2020, ArXiv.

[24]  Dmitry Vetrov,et al.  Pitfalls of In-Domain Uncertainty Estimation and Ensembling in Deep Learning , 2020, ICLR.

[25]  Laurence Aitchison,et al.  A statistical theory of cold posteriors in deep neural networks , 2020, ICLR.