Structured Dropout Variational Inference for Bayesian Neural Networks

Approximate inference in deep Bayesian networks exhibits a dilemma of how to yield high fidelity posterior approximations while maintaining computational efficiency and scalability. We tackle this challenge by introducing a novel variational structured approximation inspired by the Bayesian interpretation of Dropout regularization. Concretely, we focus on the inflexibility of the factorized structure in Dropout posterior and then propose an improved method called Variational Structured Dropout (VSD). VSD employs an orthogonal transformation to learn a structured representation on the variational noise and consequently induces statistical dependencies in the approximate posterior. Theoretically, VSD successfully addresses the pathologies of previous Variational Dropout methods and thus offers a standard Bayesian justification. We further show that VSD induces an adaptive regularization term with several desirable properties which contribute to better generalization. Finally, we conduct extensive experiments on standard benchmarks to demonstrate the effectiveness of VSD over state-of-the-art variational methods on predictive accuracy, uncertainty estimation, and out-of-distribution detection.

[1]  Jakub M. Tomczak,et al.  Improving Variational Auto-Encoders using convex combination linear Inverse Autoregressive Flow , 2017, 1706.02326.

[2]  D. Nott,et al.  Gaussian Variational Approximation With a Factor Covariance Structure , 2017, Journal of Computational and Graphical Statistics.

[3]  Richard E. Turner,et al.  Efficient Low Rank Gaussian Variational Inference for Neural Networks , 2020, NeurIPS.

[4]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[5]  Stefano Ermon,et al.  InfoVAE: Balancing Learning and Inference in Variational Autoencoders , 2019, AAAI.

[6]  W. Hager,et al.  and s , 2019, Shallow Water Hydraulics.

[7]  Viet Anh Nguyen,et al.  Machine Learning's Dropout Training is Distributionally Robust Optimal , 2020, ArXiv.

[8]  Andrew Gordon Wilson,et al.  Subspace Inference for Bayesian Deep Learning , 2019, UAI.

[9]  Kevin Smith,et al.  Bayesian Uncertainty Estimation for Batch Normalized Deep Networks , 2018, ICML.

[10]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[11]  Shin-ichi Maeda,et al.  A Bayesian encourages dropout , 2014, ArXiv.

[12]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[13]  Samuel Kaski,et al.  Informative Gaussian Scale Mixture Priors for Bayesian Neural Networks , 2020, ArXiv.

[14]  Christian H. Bischof,et al.  A Basis-Kernel Representation of Orthogonal Matrices , 1995, SIAM J. Matrix Anal. Appl..

[15]  Zoubin Ghahramani,et al.  Variational Bayesian dropout: pitfalls and fixes , 2018, ICML.

[16]  Julien Cornebise,et al.  Weight Uncertainty in Neural Network , 2015, ICML.

[17]  Milos Hauskrecht,et al.  Obtaining Well Calibrated Probabilities Using Bayesian Binning , 2015, AAAI.

[18]  Lawrence Carin,et al.  Learning Structured Weight Uncertainty in Bayesian Neural Networks , 2017, AISTATS.

[19]  Max Welling,et al.  Improved Variational Inference with Inverse Autoregressive Flow , 2016, NIPS 2016.

[20]  Jasper Snoek,et al.  Efficient and Scalable Bayesian Neural Nets with Rank-1 Factors , 2020, ICML.

[21]  Sebastian Nowozin,et al.  How Good is the Bayes Posterior in Deep Neural Networks Really? , 2020, ICML.

[22]  Padhraic Smyth,et al.  Dropout as a Structured Shrinkage Prior , 2018, ICML.

[23]  Raman Arora,et al.  On Convergence and Generalization of Dropout Training , 2020, NeurIPS.

[24]  Philip M. Long,et al.  Surprising properties of dropout in deep networks , 2017, COLT.

[25]  Zoubin Ghahramani,et al.  Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference , 2015, ArXiv.

[26]  Isaac Dialsingh,et al.  Large-scale inference: empirical Bayes methods for estimation, testing, and prediction , 2012 .

[27]  Dustin Tran,et al.  Hierarchical Variational Models , 2015, ICML.

[28]  Pan Zhou,et al.  Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning , 2020, NeurIPS.

[29]  Christian P. Robert,et al.  Large-scale inference , 2010 .

[30]  C. Bischof,et al.  On orthogonal block elimination , 1996 .

[31]  Dmitry P. Vetrov,et al.  Variational Dropout via Empirical Bayes , 2018, ArXiv.

[32]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[33]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[34]  Dmitry P. Vetrov,et al.  Doubly Semi-Implicit Variational Inference , 2018, AISTATS.

[35]  Aaron Mishkin,et al.  SLANG: Fast Structured Covariance Approximations for Bayesian Deep Learning with Natural Gradient , 2018, NeurIPS.

[36]  Max Welling,et al.  Multiplicative Normalizing Flows for Variational Bayesian Neural Networks , 2017, ICML.

[37]  Jos'e Miguel Hern'andez-Lobato,et al.  Expressive yet Tractable Bayesian Deep Learning via Subnetwork Inference , 2020, ArXiv.

[38]  Matthew Willetts,et al.  Explicit Regularisation in Gaussian Noise Injections , 2020, NeurIPS.

[39]  Andrew Gordon Wilson,et al.  Bayesian Deep Learning and a Probabilistic Perspective of Generalization , 2020, NeurIPS.

[40]  Ryan P. Adams,et al.  Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks , 2015, ICML.

[41]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[42]  Maurizio Filippone,et al.  Walsh-Hadamard Variational Inference for Bayesian Deep Learning , 2019, NeurIPS.

[43]  Sida I. Wang,et al.  Dropout Training as Adaptive Regularization , 2013, NIPS.

[44]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[45]  Martin Burger,et al.  Analysis of Tikhonov regularization for function approximation by neural networks , 2003, Neural Networks.

[46]  Jishnu Mukhoti,et al.  On the Importance of Strong Baselines in Bayesian Deep Learning , 2018, ArXiv.

[47]  Jasper Snoek,et al.  The k-tied Normal Distribution: A Compact Parameterization of Gaussian Mean Field Posteriors in Bayesian Neural Networks , 2020, ICML.

[48]  Guodong Zhang,et al.  Noisy Natural Gradient as Variational Inference , 2017, ICML.

[49]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[50]  Liwei Wang,et al.  Dropout Training, Data-dependent Regularization, and Generalization Bounds , 2018, ICML.

[51]  Didrik Nielsen,et al.  Fast and Scalable Bayesian Deep Learning by Weight-Perturbation in Adam , 2018, ICML.

[52]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[53]  Richard E. Turner,et al.  On the Expressiveness of Approximate Inference in Bayesian Neural Networks , 2019, NeurIPS.

[54]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[55]  David A. McAllester A PAC-Bayesian Tutorial with A Dropout Bound , 2013, ArXiv.

[56]  O. Zobay Variational Bayesian inference with Gaussian-mixture approximations , 2014 .

[57]  Max Welling,et al.  Bayesian Compression for Deep Learning , 2017, NIPS.

[58]  Alex Graves,et al.  Practical Variational Inference for Neural Networks , 2011, NIPS.

[59]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[60]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[61]  Finale Doshi-Velez,et al.  Structured Variational Learning of Bayesian Neural Networks with Horseshoe Priors , 2018, ICML.

[62]  Max Welling,et al.  Structured and Efficient Variational Deep Learning with Matrix Gaussian Posteriors , 2016, ICML.

[63]  Yarin Gal,et al.  Uncertainty in Deep Learning , 2016 .

[64]  Raman Arora,et al.  On the Implicit Bias of Dropout , 2018, ICML.

[65]  Han Zhao,et al.  Learning Neural Networks with Adaptive Regularization , 2019, NeurIPS.

[66]  Christopher D. Manning,et al.  Fast dropout training , 2013, ICML.

[67]  Dmitry P. Vetrov,et al.  Variational Dropout Sparsifies Deep Neural Networks , 2017, ICML.

[68]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[69]  Geoffrey E. Hinton,et al.  Keeping the neural networks simple by minimizing the description length of the weights , 1993, COLT '93.

[70]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[71]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[72]  Alexander A. Alemi,et al.  Fixing a Broken ELBO , 2017, ICML.

[73]  Omesh Tickoo,et al.  Specifying Weight Priors in Bayesian Deep Neural Networks with Empirical Bayes , 2019, AAAI.

[74]  Philip M. Long,et al.  On the inductive bias of dropout , 2014, J. Mach. Learn. Res..

[75]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[76]  Raman Arora,et al.  On Dropout and Nuclear Norm Regularization , 2019, ICML.

[77]  Max Welling,et al.  Sylvester Normalizing Flows for Variational Inference , 2018, UAI.

[78]  David Barber,et al.  A Scalable Laplace Approximation for Neural Networks , 2018, ICLR.

[79]  Gerhard Neumann,et al.  Trust-Region Variational Inference with Gaussian Mixture Models , 2019, J. Mach. Learn. Res..

[80]  Inderjit S. Dhillon,et al.  Stabilizing Gradients for Deep Neural Networks via Efficient SVD Parameterization , 2018, ICML.

[81]  Julien Cornebise,et al.  Weight Uncertainty in Neural Networks , 2015, ArXiv.

[82]  Colin Wei,et al.  The Implicit and Explicit Regularization Effects of Dropout , 2020, ICML.

[83]  Kilian Q. Weinberger,et al.  On Calibration of Modern Neural Networks , 2017, ICML.

[84]  Andrew Gordon Wilson,et al.  A Simple Baseline for Bayesian Uncertainty in Deep Learning , 2019, NeurIPS.

[85]  Charles Blundell,et al.  Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles , 2016, NIPS.

[86]  Richard Socher,et al.  Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.

[87]  Sebastian Nowozin,et al.  Deterministic Variational Inference for Robust Bayesian Neural Networks , 2018, ICLR.

[88]  Yoram Singer,et al.  A Unified Approach to Adaptive Regularization in Online and Stochastic Optimization , 2017, ArXiv.

[89]  Mingyuan Zhou,et al.  Semi-Implicit Variational Inference , 2018, ICML.

[90]  Mohammad Emtiyaz Khan,et al.  Practical Deep Learning with Bayesian Principles , 2019, NeurIPS.

[91]  P. Alam ‘A’ , 2021, Composites Engineering: An A–Z Guide.

[92]  Michael I. Jordan,et al.  An Introduction to Variational Methods for Graphical Models , 1999, Machine Learning.

[93]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[94]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[95]  Kamil Adamczewski,et al.  Radial and Directional Posteriors for Bayesian Neural Networks , 2019, ArXiv.

[96]  Yann Dauphin,et al.  Empirical Analysis of the Hessian of Over-Parametrized Neural Networks , 2017, ICLR.

[97]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[98]  A. Rukhin Matrix Variate Distributions , 1999, The Multivariate Normal Distribution.

[99]  장윤희,et al.  Y. , 2003, Industrial and Labor Relations Terms.