SGD and Weight Decay Provably Induce a Low-Rank Bias in Neural Networks

We study the bias of Stochastic Gradient Descent (SGD) to learn low-rank weight matrices when training deep ReLU neural networks. Our results show that training neural networks with mini-batch SGD and weight decay causes a bias towards rank minimization over the weight matrices. Specifically, we show, both theoretically and empirically, that this bias is more pronounced when using smaller batch sizes, higher learning rates, or increased weight decay. Additionally, we predict and observe empirically that weight decay is necessary to achieve this bias. In addition, we show that in the presence of intermediate neural collapse, the learned weights are particularly low-rank. Unlike previous literature, our analysis does not rely on assumptions about the data, convergence, or optimality of the weight matrices. Furthermore, it applies to a wide range of neural network architectures of any width or depth. Finally, we empirically investigate the connection between this bias and generalization, finding that it has a marginal effect on generalization.

[1]  Arthur Jacot,et al.  Feature Learning in L2-regularized DNNs: Attraction/Repulsion and Sparsity , 2022, NeurIPS.

[2]  R. Willett,et al.  The Role of Linear Layers in Nonlinear Interpolating Networks , 2022, ArXiv.

[3]  O. Shamir,et al.  Implicit Regularization Towards Rank Minimization in ReLU Networks , 2022, ALT.

[4]  S. Jegelka,et al.  Training invariances and the low-rank phenomenon: beyond linear networks , 2022, ICLR.

[5]  S. Dekel,et al.  Nearest Class-Center Simplification through Intermediate Layers , 2022, TAG-ML.

[6]  Yuri Burda,et al.  Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets , 2022, ArXiv.

[7]  Murad Tukan,et al.  No Fine-Tuning, No Cry: Robust SVD for Compressing Deep Networks , 2021, Sensors.

[8]  Pulkit Agrawal,et al.  The Low-Rank Simplicity Bias in Deep Networks , 2021, Trans. Mach. Learn. Res..

[9]  Kaifeng Lyu,et al.  Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning , 2020, ICLR.

[10]  S. Gelly,et al.  An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.

[11]  David L. Donoho,et al.  Prevalence of neural collapse during the terminal phase of deep learning training , 2020, Proceedings of the National Academy of Sciences.

[12]  Matus Telgarsky,et al.  Directional convergence and alignment in deep learning , 2020, NeurIPS.

[13]  Nadav Cohen,et al.  Implicit Regularization in Deep Learning May Not Be Explainable by Norms , 2020, NeurIPS.

[14]  俊一 甘利 5分で分かる!? 有名論文ナナメ読み:Jacot, Arthor, Gabriel, Franck and Hongler, Clement : Neural Tangent Kernel : Convergence and Generalization in Neural Networks , 2020 .

[15]  Mert Pilanci,et al.  Revealing the Structure of Deep Neural Networks via Convex Duality , 2020, ICML.

[16]  Nathan Srebro,et al.  A Function Space View of Bounded Norm Infinite Width ReLU Nets: The Multivariate Case , 2019, ICLR.

[17]  Nathan Srebro,et al.  How do infinite width bounded norm networks look in function space? , 2019, COLT.

[18]  Zhanxing Zhu,et al.  The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects , 2018, ICML.

[19]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[20]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.

[21]  Mathieu Salzmann,et al.  Compression-aware Training of Deep Networks , 2017, NIPS.

[22]  Dacheng Tao,et al.  On Compressing Deep Models by Low Rank and Sparse Decomposition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[23]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[24]  Nathan Srebro,et al.  Implicit Regularization in Matrix Factorization , 2017, 2018 Information Theory and Applications Workshop (ITA).

[25]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[26]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[27]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[28]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[29]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[30]  Joan Bruna,et al.  Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation , 2014, NIPS.

[31]  Y. T. Zhou,et al.  Computation of optical flow using a neural network , 1988, IEEE 1988 International Conference on Neural Networks.

[32]  T. Poggio,et al.  Feature learning in deep classifiers through Intermediate Neural Collapse , 2023, ICML.

[33]  Tomer Galanti,et al.  On the Implicit Bias Towards Depth Minimization in Deep Neural Networks , 2022 .

[34]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[35]  Simon Haykin,et al.  GradientBased Learning Applied to Document Recognition , 2001 .

[36]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[37]  L. Bottou Stochastic Gradient Learning in Neural Networks , 1991 .