Weight Expansion: A New Perspective on Dropout and Generalization

While dropout is known to be a successful regularization technique, insights into the mechanisms that lead to this success are still lacking. We introduce the concept of weight expansion, an increase in the signed volume of a parallelotope spanned by the column or row vectors of the weight covariance matrix, and show that weight expansion is an effective means of increasing the generalization in a PAC-Bayesian setting. We provide a theoretical argument that dropout leads to weight expansion and extensive empirical support for the correlation between dropout and weight expansion. To support our hypothesis that weight expansion can be regarded as an indicator of the enhanced generalization capability endowed by dropout, and not just as a mere by-product, we have studied other methods that achieve weight expansion (resp. contraction), and found that they generally lead to an increased (resp. decreased) generalization ability. This suggests that dropout is an attractive regularizer, because it is a computationally cheap method for obtaining weight expansion. This insight justifies the role of dropout as a regularizer, while paving the way for identifying regularizers that promise improved generalization through weight expansion.

[1]  John Shawe-Taylor,et al.  PAC-Bayes Unleashed: Generalisation Bounds with Unbounded Losses , 2020, Entropy.

[2]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[3]  Behnam Neyshabur,et al.  The intriguing role of module criticality in the generalization of deep networks , 2020, ICLR.

[4]  Vincent Gripon,et al.  Ranking Deep Learning Generalization using Label Variation in Latent Geometry Graphs , 2020, ArXiv.

[5]  Colin Wei,et al.  The Implicit and Explicit Regularization Effects of Dropout , 2020, ICML.

[6]  Shiliang Sun,et al.  PAC-bayes bounds with data dependent priors , 2012, J. Mach. Learn. Res..

[7]  Zhangyang Wang,et al.  Can We Gain More from Orthogonality Regularizations in Training Deep Networks? , 2018, NeurIPS.

[8]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[9]  Padhraic Smyth,et al.  Dropout as a Structured Shrinkage Prior , 2018, ICML.

[10]  Yuichi Yoshida,et al.  Spectral Norm Regularization for Improving the Generalizability of Deep Learning , 2017, ArXiv.

[11]  Raman Arora,et al.  On Convergence and Generalization of Dropout Training , 2020, NeurIPS.

[12]  Zhanxing Zhu,et al.  The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects , 2018, ICML.

[13]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[14]  Gaurav Malhotra,et al.  The role of Disentanglement in Generalisation , 2021, ICLR.

[15]  Quoc V. Le,et al.  A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.

[16]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[17]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[18]  Eric P. Xing,et al.  On Dropout, Overfitting, and Interaction Effects in Deep Neural Networks , 2020, ArXiv.

[19]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[20]  David Barber,et al.  Practical Gauss-Newton Optimisation for Deep Learning , 2017, ICML.

[21]  Ariel D. Procaccia,et al.  Variational Dropout and the Local Reparameterization Trick , 2015, NIPS.

[22]  Yoshua Bengio,et al.  A Walk with SGD , 2018, ArXiv.

[23]  Liang Zhang,et al.  How does Weight Correlation Affect the Generalisation Ability of Deep Neural Networks , 2020, ArXiv.

[24]  Jiri Matas,et al.  All you need is a good init , 2015, ICLR.

[25]  Eunho Yang,et al.  Meta Dropout: Learning to Perturb Latent Features for Generalization , 2020, ICLR.

[26]  Yoshua Bengio,et al.  On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length , 2018, ICLR.

[27]  Stefano Soatto,et al.  Information Dropout: Learning Optimal Representations Through Noisy Computation , 2016, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[28]  Xiaoning Qian,et al.  Contextual Dropout: An Efficient Sample-Dependent Dropout Module , 2021, ICLR.

[29]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[30]  Philip M. Long,et al.  Surprising properties of dropout in deep networks , 2017, COLT.

[31]  Andrew Y. Ng,et al.  Reading Digits in Natural Images with Unsupervised Feature Learning , 2011 .

[32]  Gintare Karolina Dziugaite,et al.  Entropy-SGD optimizes the prior of a PAC-Bayes bound: Data-dependent PAC-Bayes priors via differential privacy , 2017, NeurIPS.

[33]  Christian Igel,et al.  A Strongly Quasiconvex PAC-Bayesian Bound , 2016, ALT.

[34]  Andreas Maurer,et al.  A Note on the PAC Bayesian Theorem , 2004, ArXiv.

[35]  Daniel M. Roy,et al.  Sharpened Generalization Bounds based on Conditional Mutual Information and an Application to Noisy, Iterative Algorithms , 2020, NeurIPS.

[36]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[37]  Xuanjing Huang,et al.  Recurrent Neural Network for Text Classification with Multi-Task Learning , 2016, IJCAI.

[38]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[39]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[40]  Sen Wu,et al.  On the Generalization Effects of Linear Transformations in Data Augmentation , 2020, ICML.

[41]  Csaba Szepesvari,et al.  Tighter risk certificates for neural networks , 2020, J. Mach. Learn. Res..

[42]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[43]  François Laviolette,et al.  PAC-Bayesian learning of linear classifiers , 2009, ICML '09.

[44]  Ankit Singh Rawat,et al.  Overparameterisation and worst-case generalisation: friend or foe? , 2021, ICLR.

[45]  Ilja Kuzborskij,et al.  PAC-Bayes Analysis Beyond the Usual Bounds , 2020, NeurIPS.

[46]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[47]  Julien Cornebise,et al.  Weight Uncertainty in Neural Network , 2015, ICML.

[48]  Richard Socher,et al.  Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.

[49]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[50]  Josef Kittler,et al.  Learning PAC-Bayes Priors for Probabilistic Neural Networks , 2021, ArXiv.

[51]  Matthijs Douze,et al.  FastText.zip: Compressing text classification models , 2016, ArXiv.

[52]  Zoubin Ghahramani,et al.  Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning , 2015, ICML.

[53]  Sanjeev Arora,et al.  Implicit Regularization in Deep Matrix Factorization , 2019, NeurIPS.

[54]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[55]  Ioannis Mitliagkas,et al.  In Search of Robust Measures of Generalization , 2020, NeurIPS.

[56]  Manik Sharma,et al.  Representation Based Complexity Measures for Predicting Generalization in Deep Learning , 2020, ArXiv.

[57]  Raman Arora,et al.  On the Implicit Bias of Dropout , 2018, ICML.

[58]  Hossein Mobahi,et al.  Fantastic Generalization Measures and Where to Find Them , 2019, ICLR.

[59]  Yann LeCun,et al.  Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks , 2018, ArXiv.

[60]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[61]  Philip M. Long,et al.  On the inductive bias of dropout , 2014, J. Mach. Learn. Res..

[62]  David A. McAllester,et al.  A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks , 2017, ICLR.

[63]  Raman Arora,et al.  On Dropout and Nuclear Norm Regularization , 2019, ICML.

[64]  Hossein Mobahi,et al.  NeurIPS 2020 Competition: Predicting Generalization in Deep Learning , 2020, ArXiv.

[65]  John Langford,et al.  (Not) Bounding the True Error , 2001, NIPS.

[66]  J. Zico Kolter,et al.  Generalization in Deep Networks: The Role of Distance from Initialization , 2019, ArXiv.

[67]  Adrian J. Shepherd,et al.  Second-order methods for neural networks - fast and reliable training methods for multi-layer perceptrons , 1997, Perspectives in neural computing.

[68]  Aram Galstyan,et al.  Improving Generalization by Controlling Label-Noise Information in Neural Network Weights , 2020, ICML.

[69]  Toniann Pitassi,et al.  Preserving Statistical Validity in Adaptive Data Analysis , 2014, STOC.

[70]  David A. McAllester PAC-Bayesian model averaging , 1999, COLT '99.

[71]  Edouard Grave,et al.  Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.

[72]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[73]  Ruslan Salakhutdinov,et al.  Path-SGD: Path-Normalized Optimization in Deep Neural Networks , 2015, NIPS.

[74]  David Barber,et al.  A Scalable Laplace Approximation for Neural Networks , 2018, ICLR.

[75]  René Vidal,et al.  Dropout as a Low-Rank Regularizer for Matrix Factorization , 2017, AISTATS.

[76]  Stefano Soatto,et al.  Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks , 2017, 2018 Information Theory and Applications Workshop (ITA).

[77]  Pascal Germain,et al.  Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks , 2019, NeurIPS.

[78]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[79]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.

[80]  Quoc V. Le,et al.  AutoDropout: Learning Dropout Patterns to Regularize Deep Networks , 2021, AAAI.

[81]  Matthias W. Seeger,et al.  PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification , 2003, J. Mach. Learn. Res..

[82]  Roberto Battiti,et al.  First- and Second-Order Methods for Learning: Between Steepest Descent and Newton's Method , 1992, Neural Computation.

[83]  Pierre Alquier,et al.  On the properties of variational approximations of Gibbs posteriors , 2015, J. Mach. Learn. Res..

[84]  Colin Wei,et al.  Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks , 2019, NeurIPS.

[85]  David Tse,et al.  Generalizable Adversarial Training via Spectral Normalization , 2018, ICLR.

[86]  Sida I. Wang,et al.  Dropout Training as Adaptive Regularization , 2013, NIPS.

[87]  Toniann Pitassi,et al.  Generalization in Adaptive Data Analysis and Holdout Reuse , 2015, NIPS.

[88]  Chong You,et al.  Rethinking Bias-Variance Trade-off for Generalization of Neural Networks , 2020, ICML.

[89]  Pierre Vandergheynst,et al.  PAC-BAYESIAN MARGIN BOUNDS FOR CONVOLUTIONAL NEURAL NETWORKS , 2018 .