Improving Bayesian Inference in Deep Neural Networks with Variational Structured Dropout

Approximate inference in deep Bayesian networks exhibits a dilemma of how to yield high fidelity posterior approximations while maintaining computational efficiency and scalability. We tackle this challenge by introducing a new variational structured approximation inspired by the interpretation of Dropout training as approximate inference in Bayesian probabilistic models. Concretely, we focus on restrictions of the factorized structure of Dropout posterior which is inflexible to capture rich correlations among weight parameters of the true posterior, and we then propose a novel method called Variational Structured Dropout (VSD) to overcome this limitation. VSD employs an orthogonal transformation to learn a structured representation on the variational Dropout noise and consequently induces statistical dependencies in the approximate posterior. We further gain an expressive Bayesian modeling for VSD via proposing a hierarchical Dropout procedure that corresponds to the joint inference in a Bayesian network. Moreover, we can scale up VSD to modern deep convolutional networks in a direct way with low computational cost. Finally, we conduct extensive experiments on standard benchmarks to demonstrate the effectiveness of VSD over state-of-the-art methods on both predictive accuracy and uncertainty estimation.

[1]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[2]  Zoubin Ghahramani,et al.  Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference , 2015, ArXiv.

[3]  Max Welling,et al.  Improved Variational Inference with Inverse Autoregressive Flow , 2016, NIPS 2016.

[4]  Jakub M. Tomczak,et al.  Improving Variational Auto-Encoders using convex combination linear Inverse Autoregressive Flow , 2017, 1706.02326.

[5]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[6]  Lei Zhang,et al.  Variational Bayesian Dropout With a Hierarchical Prior , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Ryan P. Adams,et al.  Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks , 2015, ICML.

[8]  Jasper Snoek,et al.  The k-tied Normal Distribution: A Compact Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks , 2020, ICML.

[9]  Guodong Zhang,et al.  Noisy Natural Gradient as Variational Inference , 2017, ICML.

[10]  Yarin Gal,et al.  Uncertainty in Deep Learning , 2016 .

[11]  Raman Arora,et al.  On the Implicit Bias of Dropout , 2018, ICML.

[12]  Aaron Mishkin,et al.  SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient , 2018, NeurIPS.

[13]  Raman Arora,et al.  On Convergence and Generalization of Dropout Training , 2020, NeurIPS.

[14]  Christian P. Robert,et al.  Large-scale inference , 2010 .

[15]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[16]  A. Rukhin Matrix Variate Distributions , 1999, The Multivariate Normal Distribution.

[17]  Max Welling,et al.  Sylvester Normalizing Flows for Variational Inference , 2018, UAI.

[18]  David Barber,et al.  A Scalable Laplace Approximation for Neural Networks , 2018, ICLR.

[19]  Philip M. Long,et al.  On the inductive bias of dropout , 2014, J. Mach. Learn. Res..

[20]  Jishnu Mukhoti,et al.  On the Importance of Strong Baselines in Bayesian Deep Learning , 2018, ArXiv.

[21]  Richard E. Turner,et al.  On the Expressiveness of Approximate Inference in Bayesian Neural Networks , 2019, NeurIPS.

[22]  Liwei Wang,et al.  Dropout Training, Data-dependent Regularization, and Generalization Bounds , 2018, ICML.

[23]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[24]  Didrik Nielsen,et al.  Fast and Scalable Bayesian Deep Learning by Weight-Perturbation in Adam , 2018, ICML.

[25]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[26]  Gerhard Neumann,et al.  Trust-Region Variational Inference with Gaussian Mixture Models , 2019, J. Mach. Learn. Res..

[27]  Dustin Tran,et al.  Hierarchical Variational Models , 2015, ICML.

[28]  Dmitry P. Vetrov,et al.  Variational Dropout via Empirical Bayes , 2018, ArXiv.

[29]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[30]  Finale Doshi-Velez,et al.  Structured Variational Learning of Bayesian Neural Networks with Horseshoe Priors , 2018, ICML.

[31]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[32]  Christopher D. Manning,et al.  Fast dropout training , 2013, ICML.

[33]  Dmitry P. Vetrov,et al.  Variational Dropout Sparsifies Deep Neural Networks , 2017, ICML.

[34]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[35]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[36]  David A. McAllester A PAC-Bayesian Tutorial with A Dropout Bound , 2013, ArXiv.

[37]  Sida I. Wang,et al.  Dropout Training as Adaptive Regularization , 2013, NIPS.

[38]  Max Welling,et al.  Bayesian Compression for Deep Learning , 2017, NIPS.

[39]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[40]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[41]  O. Zobay Variational Bayesian inference with Gaussian-mixture approximations , 2014 .

[42]  Max Welling,et al.  Structured and Efficient Variational Deep Learning with Matrix Gaussian Posteriors , 2016, ICML.

[43]  C. Bischof,et al.  On orthogonal block elimination , 1996 .

[44]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[45]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[46]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[47]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[48]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[49]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[50]  Isaac Dialsingh,et al.  Large-scale inference: empirical Bayes methods for estimation, testing, and prediction , 2012 .

[51]  Max Welling,et al.  Multiplicative Normalizing Flows for Variational Bayesian Neural Networks , 2017, ICML.

[52]  D. Nott,et al.  Gaussian Variational Approximation With a Factor Covariance Structure , 2017, Journal of Computational and Graphical Statistics.

[53]  Richard E. Turner,et al.  Efficient Low Rank Gaussian Variational Inference for Neural Networks , 2020, NeurIPS.

[54]  David Barber,et al.  Online Structured Laplace Approximations For Overcoming Catastrophic Forgetting , 2018, NeurIPS.

[55]  Shin-ichi Maeda,et al.  A Bayesian encourages dropout , 2014, ArXiv.

[56]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[57]  Samuel Kaski,et al.  Informative Gaussian Scale Mixture Priors for Bayesian Neural Networks , 2020, ArXiv.

[58]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[59]  Colin Wei,et al.  The Implicit and Explicit Regularization Effects of Dropout , 2020, ICML.

[60]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[61]  Mingyuan Zhou,et al.  Semi-Implicit Variational Inference , 2018, ICML.

[62]  Mohammad Emtiyaz Khan,et al.  Practical Deep Learning with Bayesian Principles , 2019, NeurIPS.

[63]  Christian H. Bischof,et al.  A Basis-Kernel Representation of Orthogonal Matrices , 1995, SIAM J. Matrix Anal. Appl..