SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient

Uncertainty estimation in large deep-learning models is a computationally challenging task, where it is difficult to form even a Gaussian approximation to the posterior distribution. In such situations, existing methods usually resort to a diagonal approximation of the covariance matrix despite the fact that these matrices are known to give poor uncertainty estimates. To address this issue, we propose a new stochastic, low-rank, approximate natural-gradient (SLANG) method for variational inference in large deep models. Our method estimates a “diagonal plus low-rank” structure based solely on back-propagated gradients of the network log-likelihood. This requires strictly less gradient computations than methods that compute the gradient of the whole variational objective. Empirical evaluations on standard benchmarks confirm that SLANG enables faster and more accurate estimation of uncertainty than mean-field methods, and performs comparably to state-of-the-art methods.

[1]  Roland Badeau,et al.  Stochastic Quasi-Newton Langevin Monte Carlo , 2016, ICML.

[2]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[3]  Lawrence Carin,et al.  Learning Structured Weight Uncertainty in Bayesian Neural Networks , 2017, AISTATS.

[4]  David J. Nott,et al.  Gaussian variational approximation with sparse precision matrices , 2016, Statistics and Computing.

[5]  Ryan P. Adams,et al.  Variational Boosting: Iteratively Refining Posterior Approximations , 2016, ICML.

[6]  Michael I. Jordan,et al.  A Variational Approach to Bayesian Logistic Regression Models and their Extensions , 1997, AISTATS.

[7]  Michael O'Neil,et al.  Fast symmetric factorization of hierarchical matrices with applications , 2014, ArXiv.

[8]  Matthias W. Seeger,et al.  Bayesian Model Selection for Support Vector Machines, Gaussian Processes and Other Kernel Classifiers , 1999, NIPS.

[9]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[10]  Max Welling,et al.  Structured and Efficient Variational Deep Learning with Matrix Gaussian Posteriors , 2016, ICML.

[11]  Tianqi Chen,et al.  Stochastic Gradient Hamiltonian Monte Carlo , 2014, ICML.

[12]  Didrik Nielsen,et al.  Fast and Scalable Bayesian Deep Learning by Weight-Perturbation in Adam , 2018, ICML.

[13]  Masashi Sugiyama,et al.  Bayesian Dark Knowledge , 2015 .

[14]  Matthias W. Seeger,et al.  Gaussian Covariance and Scalable Variational Inference , 2010, ICML.

[15]  Mohammad Emtiyaz Khan,et al.  Conjugate-Computation Variational Inference: Converting Variational Inference in Non-Conjugate Models to Inferences in Conjugate Models , 2017, AISTATS.

[16]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[17]  Didrik Nielsen,et al.  Fast yet Simple Natural-Gradient Descent for Variational Inference in Complex Models , 2018, 2018 International Symposium on Information Theory and Its Applications (ISITA).

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  C. Rasmussen,et al.  Approximations for Binary Gaussian Process Classification , 2008 .

[20]  James Martens,et al.  New perspectives on the natural gradient method , 2014, ArXiv.

[21]  David Barber,et al.  Concave Gaussian Variational Approximations for Inference in Large-Scale Bayesian Linear Models , 2011, AISTATS.

[22]  Miguel Lázaro-Gredilla,et al.  Doubly Stochastic Variational Bayes for non-Conjugate Inference , 2014, ICML.

[23]  Richard E. Turner,et al.  Two problems with variational expectation maximisation for time-series models , 2011 .

[24]  Mohammad Emtiyaz Khan,et al.  Piecewise Bounds for Estimating Bernoulli-Logistic Latent Gaussian Models , 2011, ICML.

[25]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[26]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[27]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[28]  David Barber,et al.  A Scalable Laplace Approximation for Neural Networks , 2018, ICLR.

[29]  Guodong Zhang,et al.  Noisy Natural Gradient as Variational Inference , 2017, ICML.

[30]  D. Nott,et al.  Gaussian Variational Approximation With a Factor Covariance Structure , 2017, Journal of Computational and Graphical Statistics.

[31]  David Barber,et al.  Ensemble Learning for Multi-Layer Networks , 1997, NIPS.

[32]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[33]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[34]  Ian J. Goodfellow,et al.  Efficient Per-Example Gradient Computations , 2015, ArXiv.

[35]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[36]  Sean Gerrish,et al.  Black Box Variational Inference , 2013, AISTATS.

[37]  Ryan P. Adams,et al.  Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks , 2015, ICML.

[38]  Ahn,et al.  Bayesian posterior sampling via stochastic gradient Fisher scoring Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring , 2012 .