Unifying the Dropout Family Through Structured Shrinkage Priors

Dropout regularization of deep neural networks has been a mysterious yet effective tool to prevent overfitting. Explanations for its success range from the prevention of "co-adapted" weights to it being a form of cheap Bayesian inference. We propose a novel framework for understanding multiplicative noise in neural networks, considering continuous distributions as well as Bernoulli (i.e. dropout). We show that multiplicative noise induces structured shrinkage priors on a network's weights. We derive the equivalence through reparametrization properties of scale mixtures and not via any approximation. Given the equivalence, we then show that dropout's usual Monte Carlo training objective approximates marginal MAP estimation. We analyze this MAP objective under strong shrinkage, showing the expanded parametrization (i.e. likelihood noise) is more stable than a hierarchical representation. Lastly, we derive analogous priors for ResNets, RNNs, and CNNs and reveal their equivalent implementation as noise.

[1]  Yoshua Bengio,et al.  Fraternal Dropout , 2017, ICLR.

[2]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[3]  Debora S. Marks,et al.  Variational Inference for Sparse and Undirected Models , 2016, ICML.

[4]  Ian Osband,et al.  Risk versus Uncertainty in Deep Learning: Bayes, Bootstrap and the Dangers of Dropout , 2016 .

[5]  C. Mallows,et al.  Scale Mixing of Symmetric Distributions with Zero Means , 1959 .

[6]  B. Mallick VARIABLE SELECTION FOR REGRESSION MODELS , 2016 .

[7]  David A. Forsyth,et al.  Swapout: Learning an ensemble of deep architectures , 2016, NIPS.

[8]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[9]  Kilian Q. Weinberger,et al.  Deep Networks with Stochastic Depth , 2016, ECCV.

[10]  Zoubin Ghahramani,et al.  Bayesian Convolutional Neural Networks with Bernoulli Approximate Variational Inference , 2015, ArXiv.

[11]  David J. C. MacKay,et al.  BAYESIAN NON-LINEAR MODELING FOR THE PREDICTION COMPETITION , 1996 .

[12]  O. Barndorff-Nielsen Exponentially decreasing distributions for the logarithm of particle size , 1977, Proceedings of the Royal Society of London. A. Mathematical and Physical Sciences.

[13]  J. Mooij,et al.  Smart Regularization of Deep Architectures , 2015 .

[14]  Zoubin Ghahramani,et al.  Variational Bayesian dropout: pitfalls and fixes , 2018, ICML.

[15]  Alex Kendall,et al.  Concrete Dropout , 2017, NIPS.

[16]  Philip M. Long,et al.  On the inductive bias of dropout , 2014, J. Mach. Learn. Res..

[17]  Dmitry P. Vetrov,et al.  Variational Dropout Sparsifies Deep Neural Networks , 2017, ICML.

[18]  Zoubin Ghahramani,et al.  A Theoretically Grounded Application of Dropout in Recurrent Neural Networks , 2015, NIPS.

[19]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[20]  Dacheng Tao,et al.  Continuous Dropout , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[21]  Veronika Rockova,et al.  Posterior Concentration for Sparse Deep Learning , 2018, NeurIPS.

[22]  Jonathan Tompson,et al.  Efficient object localization using Convolutional Networks , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  T. J. Mitchell,et al.  Bayesian Variable Selection in Linear Regression , 1988 .

[24]  Jürgen Schmidhuber,et al.  Training Very Deep Networks , 2015, NIPS.

[25]  James G. Scott,et al.  Handling Sparsity via the Horseshoe , 2009, AISTATS.

[26]  D. F. Andrews,et al.  Scale Mixtures of Normal Distributions , 1974 .

[27]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[28]  Jakub M. Tomczak Prediction of breast cancer recurrence using Classification Restricted Boltzmann Machine with Dropping , 2013, ArXiv.

[29]  P. McCullagh,et al.  Generalized Linear Models , 1984 .

[30]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[31]  M. Steel,et al.  BAYESIAN REGRESSION ANALYSIS WITH SCALE MIXTURES OF NORMALS , 2000, Econometric Theory.

[32]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[33]  Stefano Soatto,et al.  Information Dropout: Learning Optimal Representations Through Noisy Computation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Sida I. Wang,et al.  Altitude Training: Strong Bounds for Single-Layer Dropout , 2014, NIPS.

[35]  Pierre Baldi,et al.  Understanding Dropout , 2013, NIPS.

[36]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[37]  Pradeep Dubey,et al.  BlackOut: Speeding up Recurrent Neural Network Language Models With Very Large Vocabularies , 2015, ICLR.

[38]  M. Nakagami The m-Distribution—A General Formula of Intensity Distribution of Rapid Fading , 1960 .

[39]  Shin-ichi Maeda,et al.  A Bayesian encourages dropout , 2014, ArXiv.

[40]  Christopher D. Manning,et al.  Fast dropout training , 2013, ICML.

[41]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[42]  Padhraic Smyth,et al.  Learning Priors for Invariance , 2018, AISTATS.

[43]  E. George,et al.  Journal of the American Statistical Association is currently published by American Statistical Association. , 2007 .

[44]  Geoffrey E. Hinton,et al.  Bayesian Learning for Neural Networks , 1995 .

[45]  Sida I. Wang,et al.  Dropout Training as Adaptive Regularization , 2013, NIPS.

[46]  Yoshua Bengio,et al.  Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations , 2016, ICLR.

[47]  Mikkel N. Schmidt,et al.  Bayesian dropout , 2015, ANT/EDI40.

[48]  Brendan J. Frey,et al.  Adaptive dropout for training deep neural networks , 2013, NIPS.

[49]  S. Mohamed A Statistical View of Deep Learning , 2015 .

[50]  K. Lang,et al.  Learning to tell two spirals apart , 1988 .

[51]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..