On Stein Variational Neural Network Ensembles

Ensembles of deep neural networks have achieved great success recently, but they do not offer a proper Bayesian justification. Moreover, while they allow for averaging of predictions over several hypotheses, they do not provide any guarantees for their diversity, leading to redundant solutions in function space. In contrast, particle-based inference methods, such as Stein variational gradient descent (SVGD), offer a Bayesian framework, but rely on the choice of a kernel to measure the similarity between ensemble members. In this work, we study different SVGD methods operating in the weight space, function space, and in a hybrid setting. We compare the SVGD approaches to other ensembling-based methods in terms of their theoretical properties and assess their empirical performance on synthetic and real-world tasks. We find that SVGD using functional and hybrid kernels can overcome the limitations of deep ensembles. It improves on functional diversity and uncertainty estimation and approaches the true Bayesian posterior more closely. Moreover, we show that using stochastic SVGD updates, as opposed to the standard deterministic ones, can further improve the performance.

[1]  Arthur Gretton,et al.  A Non-Asymptotic Analysis for Stein Variational Gradient Descent , 2020, NeurIPS.

[2]  Vincent Fortuin,et al.  Annealed Stein Variational Gradient Descent , 2021, ArXiv.

[3]  Qiang Liu,et al.  Stein Variational Gradient Descent as Gradient Flow , 2017, NIPS.

[4]  Paul A. Szerlip,et al.  Applying SVGD to Bayesian Neural Networks for Cyclical Time-Series Prediction and Inference , 2019, ArXiv.

[5]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[6]  Yixuan Li,et al.  Robust Out-of-distribution Detection for Neural Networks , 2020, 2003.09711.

[7]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[8]  Vincent Fortuin,et al.  Exact Langevin Dynamics with Stochastic Gradients , 2021, ArXiv.

[9]  Chang Liu,et al.  Understanding and Accelerating Particle-Based Variational Inference , 2018, ICML.

[10]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[11]  Sebastian Nowozin,et al.  How Good is the Bayes Posterior in Deep Neural Networks Really? , 2020, ICML.

[12]  David Ríos Insua,et al.  Stochastic Gradient MCMC with Repulsive Forces , 2018, ArXiv.

[13]  Aki Vehtari,et al.  Stacking for Non-mixing Bayesian Computations: The Curse and Blessing of Multimodal Posteriors , 2020 .

[14]  Roberto Cipolla,et al.  Understanding symmetries in deep networks , 2015, ArXiv.

[15]  Balaji Lakshminarayanan,et al.  Deep Ensembles: A Loss Landscape Perspective , 2019, ArXiv.

[16]  Mark van der Wilk,et al.  Understanding Variational Inference in Function-Space , 2020, ArXiv.

[17]  Alexandre Hoang Thiery,et al.  Uncertainty Quantification and Deep Ensembles , 2020, NeurIPS.

[18]  Xiaodong Liu,et al.  Cyclical Annealing Schedule: A Simple Approach to Mitigating KL Vanishing , 2019, NAACL.

[19]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Guodong Zhang,et al.  Functional Variational Bayesian Neural Networks , 2019, ICLR.

[21]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[22]  Kilian Q. Weinberger,et al.  Snapshot Ensembles: Train 1, get M for free , 2017, ICLR.

[23]  Andrey Malinin,et al.  Ensemble Distribution Distillation , 2019, ICLR.

[24]  Jun Zhu,et al.  A Spectral Approach to Gradient Estimation for Implicit Distributions , 2018, ICML.

[25]  Vincent Fortuin,et al.  Priors in Bayesian Deep Learning: A Review , 2021, ArXiv.

[26]  Richard E. Turner,et al.  Conservative Uncertainty Estimation By Fitting Prior Networks , 2020, ICLR.

[27]  Jasper Snoek,et al.  Efficient and Scalable Bayesian Neural Nets with Rank-1 Factors , 2020, ICML.

[28]  Wei-Cheng Chang,et al.  Kernel Stein Generative Modeling , 2020, ArXiv.

[29]  Andrew Gordon Wilson,et al.  What Are Bayesian Neural Network Posteriors Really Like? , 2021, ICML.

[30]  Lester W. Mackey,et al.  Stochastic Stein Discrepancies , 2020, NeurIPS.

[31]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[32]  Esther Levin,et al.  A statistical approach to learning and generalization in layered neural networks , 1989, Proc. IEEE.

[33]  Mao Ye,et al.  Stein Self-Repulsive Dynamics: Benefits From Past Samples , 2020, NeurIPS.

[34]  Qiang Liu,et al.  Stein Variational Gradient Descent With Matrix-Valued Kernels , 2019, NeurIPS.

[35]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[36]  Jasper Snoek,et al.  The k-tied Normal Distribution: A Compact Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks , 2020, ICML.

[37]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[38]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[39]  Changyou Chen,et al.  Stochastic Particle-Optimization Sampling and the Non-Asymptotic Convergence Theory , 2018, AISTATS.

[40]  R. Srikant,et al.  Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks , 2017, ICLR.

[41]  Farhan Abrol,et al.  Variational Tempering , 2016, AISTATS.

[42]  Dilin Wang,et al.  Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm , 2016, NIPS.

[43]  Gunnar Ratsch,et al.  Scalable Marginal Likelihood Estimation for Model Selection in Deep Learning , 2021, ICML.

[44]  Tiangang Cui,et al.  A Stein variational Newton method , 2018, NeurIPS.

[45]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[46]  Ning Chen,et al.  Message Passing Stein Variational Gradient Descent , 2017, ICML.

[47]  Mark van der Wilk,et al.  BNNpriors: A library for Bayesian neural network inference with different prior distributions , 2021, Softw. Impacts.

[48]  Sebastian Nowozin,et al.  Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift , 2019, NeurIPS.

[49]  Jasper Snoek,et al.  Hyperparameter Ensembles for Robustness and Uncertainty Quantification , 2020, NeurIPS.

[50]  Andrew Gordon Wilson,et al.  Bayesian Deep Learning and a Probabilistic Perspective of Generalization , 2020, NeurIPS.

[51]  Bo Zhang,et al.  Function Space Particle Optimization for Bayesian Neural Networks , 2019, ICLR.

[52]  Andrew Gordon Wilson,et al.  Cyclical Stochastic Gradient MCMC for Bayesian Deep Learning , 2019, ICLR.

[53]  Benjamin F. Grewe,et al.  Continual learning with hypernetworks , 2019, ICLR.

[54]  A. Duncan,et al.  On the geometry of Stein variational gradient descent , 2019, ArXiv.

[55]  Oleksandr Makeyev,et al.  Neural network with ensembles , 2010, The 2010 International Joint Conference on Neural Networks (IJCNN).

[56]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[57]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[58]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[59]  Tyler Maunu,et al.  SVGD as a kernelized Wasserstein gradient flow of the chi-squared divergence , 2020, NeurIPS.

[60]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[61]  Mark van der Wilk,et al.  Bayesian Neural Network Priors Revisited , 2021, ArXiv.

[62]  Tianqi Chen,et al.  A Complete Recipe for Stochastic Gradient MCMC , 2015, NIPS.