Disentangling the Roles of Curation, Data-Augmentation and the Prior in the Cold Posterior Effect

The “cold posterior effect” (CPE) in Bayesian deep learning describes the uncomforting observation that the predictive performance of Bayesian neural networks can be significantly improved if the Bayes posterior is artificially sharpened using a temperature parameter T < 1. The CPE is problematic in theory and practice and since the effect was identified many researchers have proposed hypotheses to explain the phenomenon. However, despite this intensive research effort the effect remains poorly understood. In this work we provide novel and nuanced evidence relevant to existing explanations for the cold posterior effect, disentangling three hypotheses: 1. The dataset curation hypothesis of Aitchison (2020): we show empirically that the CPE does not arise in a real curated data set but can be produced in a controlled experiment with varying curation strength. 2. The data augmentation hypothesis of Izmailov et al. (2021) and Fortuin et al. (2021): we show empirically that data augmentation is sufficient but not necessary for the CPE to be present. 3. The bad prior hypothesis of Wenzel et al. (2020): we use a simple experiment evaluating the relative importance of the prior and the likelihood, strongly linking the CPE to the prior. Our results demonstrate how the CPE can arise in isolation from synthetic curation, data augmentation, and bad priors. Cold posteriors observed “in the wild” are therefore unlikely to arise from a single simple cause; as a result, we do not expect a simple “fix” for cold posteriors.

[1]  Sebastian Nowozin,et al.  How Good is the Bayes Posterior in Deep Neural Networks Really? , 2020, ICML.

[2]  Tianqi Chen,et al.  Stochastic Gradient Hamiltonian Monte Carlo , 2014, ICML.

[3]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[4]  Andrew Gordon Wilson,et al.  The Case for Bayesian Deep Learning , 2020, ArXiv.

[5]  A. Bhattacharya,et al.  Bayesian fractional posteriors , 2016, The Annals of Statistics.

[6]  David Barber,et al.  Ensemble Learning for Multi-Layer Networks , 1997, NIPS.

[7]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[8]  Andrew Gordon Wilson,et al.  Bayesian Deep Learning and a Probabilistic Perspective of Generalization , 2020, NeurIPS.

[9]  Geoffrey E. Hinton,et al.  The Helmholtz Machine , 1995, Neural Computation.

[10]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[11]  Max Welling,et al.  The Deep Weight Prior , 2018, ICLR.

[12]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[13]  Mark van der Wilk,et al.  Bayesian Neural Network Priors Revisited , 2021, ArXiv.

[14]  Andrew Gordon Wilson,et al.  Cyclical Stochastic Gradient MCMC for Bayesian Deep Learning , 2019, ICLR.

[15]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[16]  Lawrence Carin,et al.  Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks , 2015, AAAI.

[17]  Tianqi Chen,et al.  A Complete Recipe for Stochastic Gradient MCMC , 2015, NIPS.

[18]  Thijs van Ommen,et al.  Inconsistency of Bayesian Inference for Misspecified Linear Models, and a Proposal for Repairing It , 2014, 1412.3730.

[19]  Jasper Snoek,et al.  Cold Posteriors and Aleatoric Uncertainty , 2020, ArXiv.

[20]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[21]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[22]  James Hensman,et al.  Learning Invariances using the Marginal Likelihood , 2018, NeurIPS.

[23]  Andrew Gordon Wilson,et al.  What Are Bayesian Neural Network Posteriors Really Like? , 2021, ICML.

[24]  Julien Cornebise,et al.  Weight Uncertainty in Neural Network , 2015, ICML.

[25]  Zhe Gan,et al.  Bridging the Gap between Stochastic Gradient MCMC and Stochastic Optimization , 2015, AISTATS.

[26]  Alec Radford,et al.  Scaling Laws for Neural Language Models , 2020, ArXiv.

[27]  Ari Pakman,et al.  Why Cold Posteriors? On the Suboptimal Generalization of Optimal Bayes Estimates , 2021 .

[28]  Lennard Jansen,et al.  Robust Bayesian inference under model misspecification , 2013 .

[29]  Thomas L. Griffiths,et al.  Human Uncertainty Makes Classification More Robust , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[30]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[31]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[32]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[33]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[34]  Peter Grünwald,et al.  Safe Learning: bridging the gap between Bayes, MDL and statistical learning theory via empirical convexity , 2011, COLT.

[35]  Laurence Aitchison A statistical theory of cold posteriors in deep neural networks , 2021, ICLR.

[36]  Benedict Leimkuhler,et al.  Partitioned integrators for thermodynamic parameterization of neural networks , 2019, ArXiv.

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.