SGD Noise and Implicit Low-Rank Bias in Deep Neural Networks

We analyze deep ReLU neural networks trained with mini-batch Stochastic Gradient Descent (SGD) and weight decay. We study the source of SGD noise and prove that when training with weight decay, the only solutions of SGD at convergence are zero functions. Furthermore, we show, both theoretically and empirically, that when training a neural network using SGD with weight decay and small batch size, the resulting weight matrices are expected to be of small rank. Our analysis relies on a minimal set of assumptions and the neural networks may be arbitrarily wide or deep, and may include residual connections, as well as batch normalization layers.

[1]  Hermann Ney,et al.  Mean-normalized stochastic gradient for large-scale deep learning , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Tomaso Poggio,et al.  Generalization in deep network classifiers trained with the square loss1 , 2020 .

[3]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[4]  Ross B. Girshick,et al.  Mask R-CNN , 2017, 1703.06870.

[5]  Zhanxing Zhu,et al.  The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects , 2018, ICML.

[6]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[7]  Matus Telgarsky,et al.  Directional convergence and alignment in deep learning , 2020, NeurIPS.

[8]  Murad Tukan,et al.  No Fine-Tuning, No Cry: Robust SVD for Compressing Deep Networks , 2021, Sensors.

[9]  Kaifeng Lyu,et al.  Towards Resolving the Implicit Bias of Gradient Descent for Matrix Factorization: Greedy Low-Rank Learning , 2020, ICLR.

[10]  Pierre H. Richemond,et al.  Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning , 2020, NeurIPS.

[11]  Zhiyuan Zhang,et al.  Understanding and Improving Layer Normalization , 2019, NeurIPS.

[12]  Sanjeev Arora,et al.  What Happens after SGD Reaches Zero Loss? -A Mathematical Framework , 2021, ICLR.

[13]  Mathieu Salzmann,et al.  Compression-aware Training of Deep Networks , 2017, NIPS.

[14]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[15]  Sanjeev Khudanpur,et al.  Parallel training of DNNs with Natural Gradient and Parameter Averaging , 2014 .

[16]  Nathan Srebro,et al.  Implicit Regularization in Matrix Factorization , 2017, 2018 Information Theory and Applications Workshop (ITA).

[17]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[18]  G. Montavon,et al.  XAI for Transformers: Better Explanations through Conservative Propagation , 2022, ICML.

[19]  Nadav Cohen,et al.  Implicit Regularization in Deep Learning May Not Be Explainable by Norms , 2020, NeurIPS.

[20]  Pulkit Agrawal,et al.  The Low-Rank Simplicity Bias in Deep Networks , 2021, ArXiv.

[21]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[22]  Bernhard Pfahringer,et al.  Regularisation of neural networks by enforcing Lipschitz continuity , 2018, Machine Learning.

[23]  Alex Krizhevsky,et al.  One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[24]  L. Bottou Stochastic Gradient Learning in Neural Networks , 1991 .

[25]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[26]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[27]  Dacheng Tao,et al.  On Compressing Deep Models by Low Rank and Sparse Decomposition , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Yoshua Bengio,et al.  Object Recognition with Gradient-Based Learning , 1999, Shape, Contour and Grouping in Computer Vision.

[29]  Yi Zhang,et al.  Stronger generalization bounds for deep nets via a compression approach , 2018, ICML.

[30]  Sergey Ioffe,et al.  Batch Renormalization: Towards Reducing Minibatch Dependence in Batch-Normalized Models , 2017, NIPS.

[31]  Tapani Raiko,et al.  Deep Learning Made Easier by Linear Transformations in Perceptrons , 2012, AISTATS.

[32]  Qianli Liao,et al.  Theoretical issues in deep networks , 2020, Proceedings of the National Academy of Sciences.

[33]  Yahya Al-Hazmi,et al.  Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) , 2014, ICPP 2014.

[34]  Joan Bruna,et al.  Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation , 2014, NIPS.

[35]  O. Shamir,et al.  Implicit Regularization Towards Rank Minimization in ReLU Networks , 2022, ALT.

[36]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.