Dropout as a Structured Shrinkage Prior

Dropout regularization of deep neural networks has been a mysterious yet effective tool to prevent overfitting. Explanations for its success range from the prevention of "co-adapted" weights to it being a form of cheap Bayesian inference. We propose a novel framework for understanding multiplicative noise in neural networks, considering continuous distributions as well as Bernoulli noise (i.e. dropout). We show that multiplicative noise induces structured shrinkage priors on a network's weights. We derive the equivalence through reparametrization properties of scale mixtures and without invoking any approximations. Given the equivalence, we then show that dropout's Monte Carlo training objective approximates marginal MAP estimation. We leverage these insights to propose a novel shrinkage framework for resnets, terming the prior 'automatic depth determination' as it is the natural analog of automatic relevance determination for network depth. Lastly, we investigate two inference strategies that improve upon the aforementioned MAP approximation in regression benchmarks.

[1]  仲上 稔,et al.  The m-Distribution As the General Formula of Intensity Distribution of Rapid Fading , 1957 .

[2]  C. Mallows,et al.  Scale Mixing of Symmetric Distributions with Zero Means , 1959 .

[3]  O. Barndorff-Nielsen Exponentially decreasing distributions for the logarithm of particle size , 1977, Proceedings of the Royal Society of London. A. Mathematical and Physical Sciences.

[4]  M. West On scale mixtures of normal distributions , 1987 .

[5]  T. J. Mitchell,et al.  Bayesian Variable Selection in Linear Regression , 1988 .

[6]  K. Lang,et al.  Learning to tell two spirals apart , 1988 .

[7]  David J. C. MacKay,et al.  Bayesian Interpolation , 1992, Neural Computation.

[8]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[9]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[10]  David J. C. MacKay,et al.  BAYESIAN NON-LINEAR MODELING FOR THE PREDICTION COMPETITION , 1996 .

[11]  M. Steel,et al.  BAYESIAN REGRESSION ANALYSIS WITH SCALE MIXTURES OF NORMALS , 2000, Econometric Theory.

[12]  George Eastman House,et al.  Sparse Bayesian Learning and the Relevan e Ve tor Ma hine , 2001 .

[13]  Eric R. Ziegel,et al.  Generalized Linear Models , 2002, Technometrics.

[14]  Matthew J. Beal,et al.  The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures , 2003 .

[15]  Michael I. Jordan,et al.  Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces , 2004, J. Mach. Learn. Res..

[16]  E. Ionides Truncated Importance Sampling , 2008 .

[17]  James G. Scott,et al.  Handling Sparsity via the Horseshoe , 2009, AISTATS.

[18]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[19]  Jakub M. Tomczak Prediction of breast cancer recurrence using Classification Restricted Boltzmann Machine with Dropping , 2013, ArXiv.

[20]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[21]  Pierre Baldi,et al.  Understanding Dropout , 2013, NIPS.

[22]  Sida I. Wang,et al.  Dropout Training as Adaptive Regularization , 2013, NIPS.

[23]  Christopher D. Manning,et al.  Fast dropout training , 2013, ICML.

[24]  Brendan J. Frey,et al.  Adaptive dropout for training deep neural networks , 2013, NIPS.

[25]  Shin-ichi Maeda,et al.  A Bayesian encourages dropout , 2014, ArXiv.

[26]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[27]  Shinichi Nakajima,et al.  Analysis of Empirical MAP and Empirical Partially Bayes: Can They be Alternatives to Variational Bayes? , 2014, AISTATS.

[28]  Sida I. Wang,et al.  Altitude Training: Strong Bounds for Single-Layer Dropout , 2014, NIPS.

[29]  Philip M. Long,et al.  On the inductive bias of dropout , 2014, J. Mach. Learn. Res..

[30]  Zoubin Ghahramani,et al.  Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference , 2015, ArXiv.

[31]  Ryan P. Adams,et al.  Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks , 2015, ICML.

[32]  S. Mohamed A Statistical View of Deep Learning , 2015 .

[33]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[34]  J. Mooij,et al.  Smart Regularization of Deep Architectures , 2015 .

[35]  Jonathan Tompson,et al.  Efficient object localization using Convolutional Networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[37]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[38]  Pradeep Dubey,et al.  BlackOut: Speeding up Recurrent Neural Network Language Models With Very Large Vocabularies , 2015, ICLR.

[39]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[40]  B. Mallick VARIABLE SELECTION FOR REGRESSION MODELS , 2016 .

[41]  Ruslan Salakhutdinov,et al.  Importance Weighted Autoencoders , 2015, ICLR.

[42]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[43]  Daniel Hernández-Lobato,et al.  Deep Gaussian Processes for Regression using Approximate Expectation Propagation , 2016, ICML.

[44]  Ian Osband,et al.  Risk versus Uncertainty in Deep Learning: Bayes, Bootstrap and the Dangers of Dropout , 2016 .

[45]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[46]  David A. Forsyth,et al.  Swapout: Learning an ensemble of deep architectures , 2016, NIPS.

[47]  Alex Kendall,et al.  Concrete Dropout , 2017, NIPS.

[48]  Dmitry P. Vetrov,et al.  Variational Dropout Sparsifies Deep Neural Networks , 2017, ICML.

[49]  Debora S. Marks,et al.  Variational Inference for Sparse and Undirected Models , 2016, ICML.

[50]  Bohyung Han,et al.  Regularizing Deep Neural Networks by Noise: Its Interpretation and Optimization , 2017, NIPS.

[51]  Max Welling,et al.  Bayesian Compression for Deep Learning , 2017, NIPS.

[52]  Yoshua Bengio,et al.  Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations , 2016, ICLR.

[53]  Dmitry P. Vetrov,et al.  Structured Bayesian Pruning via Log-Normal Multiplicative Noise , 2017, NIPS.

[54]  Zoubin Ghahramani,et al.  Variational Bayesian dropout: pitfalls and fixes , 2018, ICML.

[55]  Dacheng Tao,et al.  Continuous Dropout , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[56]  Hao Liu,et al.  Variational Inference with Tail-adaptive f-Divergence , 2018, NeurIPS.

[57]  Padhraic Smyth,et al.  Learning Priors for Invariance , 2018, AISTATS.

[58]  Huan Wang,et al.  Adaptive Dropout with Rademacher Complexity Regularization , 2018, ICLR.

[59]  Veronika Rockova,et al.  Posterior Concentration for Sparse Deep Learning , 2018, NeurIPS.

[60]  Stefano Soatto,et al.  Information Dropout: Learning Optimal Representations Through Noisy Computation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[61]  Finale Doshi-Velez,et al.  Structured Variational Learning of Bayesian Neural Networks with Horseshoe Priors , 2018, ICML.

[62]  Sebastian Nowozin,et al.  Deterministic Variational Inference for Robust Bayesian Neural Networks , 2018, ICLR.

[63]  Bin Li,et al.  β-Dropout: A Unified Dropout , 2019, IEEE Access.

[64]  Aki Vehtari,et al.  Expectation Propagation as a Way of Life: A Framework for Bayesian Inference on Partitioned Data , 2014, J. Mach. Learn. Res..