Efficient Low Rank Gaussian Variational Inference for Neural Networks

Bayesian neural networks are enjoying a renaissance driven in part by recent advances in variational inference (VI). The most common form of VI employs a fully factorized or mean-field distribution, but this is known to suffer from several pathologies, especially as we expect posterior distributions with highly correlated parameters. Current algorithms that capture these correlations with a Gaussian approximating family are difficult to scale to large models due to computational costs and high variance of gradient updates. By using a new form of the reparametrization trick, we derive a computationally efficient algorithm for performing VI with a Gaussian family with a low-rank plus diagonal covariance structure. We scale to deep feed-forward and convolutional architectures. We find that adding low-rank terms to parametrized diagonal covariance does not improve predictive performance except on small networks, but low-rank terms added to a constant diagonal covariance improves performance on small and large-scale network architectures.

[1]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[2]  Charles M. Bishop,et al.  Ensemble learning in Bayesian neural networks , 1998 .

[3]  Yoshua Bengio,et al.  Object Recognition with Gradient-Based Learning , 1999, Shape, Contour and Grouping in Computer Vision.

[4]  Michael E. Tipping,et al.  Probabilistic Principal Component Analysis , 1999 .

[5]  Neil D. Lawrence,et al.  Probabilistic Non-linear Principal Component Analysis with Gaussian Process Latent Variable Models , 2005, J. Mach. Learn. Res..

[6]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[7]  Matthias W. Seeger,et al.  Gaussian Covariance and Scalable Variational Inference , 2010, ICML.

[8]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[9]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[10]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[11]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[12]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[13]  Miguel Lázaro-Gredilla,et al.  Doubly Stochastic Variational Bayes for non-Conjugate Inference , 2014, ICML.

[14]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[15]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[16]  Ryan P. Adams,et al.  Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks , 2015, ICML.

[17]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[18]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[19]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[20]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Max Welling,et al.  Structured and Efficient Variational Deep Learning with Matrix Gaussian Posteriors , 2016, ICML.

[22]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[23]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[24]  Sepp Hochreiter,et al.  Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[25]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[26]  Yee Whye Teh,et al.  The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables , 2016, ICLR.

[27]  Ryan P. Adams,et al.  Variational Boosting: Iteratively Refining Posterior Approximations , 2016, ICML.

[28]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[29]  Roland Vollgraf,et al.  Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms , 2017, ArXiv.

[30]  Alex Kendall,et al.  What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? , 2017, NIPS.

[31]  D. Nott,et al.  Gaussian Variational Approximation With a Factor Covariance Structure , 2017, Journal of Computational and Graphical Statistics.

[32]  Alex Lamb,et al.  Deep Learning for Classical Japanese Literature , 2018, ArXiv.

[33]  Sergey Levine,et al.  Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models , 2018, NeurIPS.

[34]  Dustin Tran,et al.  Flipout: Efficient Pseudo-Independent Weight Perturbations on Mini-Batches , 2018, ICLR.

[35]  Richard E. Turner,et al.  Overpruning in Variational Bayesian Neural Networks , 2018, 1801.06230.

[36]  Aaron Mishkin,et al.  SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient , 2018, NeurIPS.

[37]  David J. Nott,et al.  Gaussian variational approximation with sparse precision matrices , 2016, Statistics and Computing.

[38]  Richard E. Turner,et al.  Pathologies of Factorised Gaussian and MC Dropout Posteriors in Bayesian Neural Networks , 2019, ArXiv.

[39]  Richard E. Turner,et al.  Practical Deep Learning with Bayesian Principles , 2019, Neural Information Processing Systems.

[40]  Richard E. Turner,et al.  'In-Between' Uncertainty in Bayesian Neural Networks , 2019, ArXiv.

[41]  Kamil Adamczewski,et al.  Radial and Directional Posteriors for Bayesian Neural Networks , 2019, ArXiv.

[42]  Jasper Snoek,et al.  The k-tied Normal Distribution: A Compact Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks , 2020, ICML.

[43]  Sebastian Nowozin,et al.  Independent Subspace Analysis for Unsupervised Learning of Disentangled Representations , 2019, AISTATS.

[44]  Yarin Gal,et al.  Try Depth Instead of Weight Correlations: Mean-field is a Less Restrictive Assumption for Deeper Networks , 2020, ArXiv.