Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks

Stochastic gradient descent with a large initial learning rate is a widely adopted method for training modern neural net architectures. Although a small initial learning rate allows for faster training and better test performance initially, the large learning rate achieves better generalization soon after the learning rate is annealed. Towards explaining this phenomenon, we devise a setting in which we can prove that a two layer network trained with large initial learning rate and annealing provably generalizes better than the same network trained with a small learning rate from the start. The key insight in our analysis is that the order of learning different types of patterns is crucial: because the small learning rate model first memorizes low noise, hard-to-fit patterns, it generalizes worse on higher noise, easier-to-fit patterns than its large learning rate counterpart. This concept translates to a larger-scale setting: we demonstrate that one can add a small patch to CIFAR-10 images that is immediately memorizable by a model with small initial learning rate, but ignored by the model with large learning rate until after annealing. Our experiments show that this causes the small learning rate model's accuracy on unmodified images to suffer, as it relies too much on the patch early on.

[1]  Leslie N. Smith,et al.  Cyclical Learning Rates for Training Neural Networks , 2015, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[2]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[3]  Fred Zhang,et al.  SGD on Neural Networks Learns Functions of Increasing Complexity , 2019, NeurIPS.

[4]  Quoc V. Le,et al.  A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.

[5]  Yang You,et al.  Large Batch Training of Convolutional Networks , 2017, 1708.03888.

[6]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[7]  Yuanzhi Li,et al.  Algorithmic Regularization in Over-parameterized Matrix Recovery , 2017, ArXiv.

[8]  Xu Sun,et al.  Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.

[9]  Yoshua Bengio,et al.  A Walk with SGD , 2018, ArXiv.

[10]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[11]  Guodong Zhang,et al.  An Empirical Study of Large-Batch Stochastic Gradient Descent with Structured Covariance Noise , 2019 .

[12]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[13]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[14]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[15]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[16]  Sham M. Kakade,et al.  The Step Decay Schedule: A Near Optimal, Geometrically Decaying Learning Rate Procedure , 2019, NeurIPS.

[17]  Yuanzhi Li,et al.  On the Convergence Rate of Training Recurrent Neural Networks , 2018, NeurIPS.

[18]  Vinay Uday Prabhu,et al.  Do deep neural networks learn shallow learnable examples first , 2019 .

[19]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[20]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[21]  Yuanzhi Li,et al.  What Can ResNet Learn Efficiently, Going Beyond Kernels? , 2019, NeurIPS.

[22]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[23]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[24]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[25]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[26]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[27]  Yuhua Zhu,et al.  Towards Theoretical Understanding of Large Batch Training in Stochastic Gradient Descent , 2018, ArXiv.

[28]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[29]  Yuanzhi Li,et al.  Can SGD Learn Recurrent Neural Networks with Provable Generalization? , 2019, NeurIPS.

[30]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[31]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[33]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[34]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[35]  Yuandong Tian,et al.  Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima , 2017, ICML.

[36]  Jinghui Chen,et al.  Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks , 2018, IJCAI.

[37]  Richard Socher,et al.  Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.

[38]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[39]  Yuanzhi Li,et al.  An Alternative View: When Does SGD Escape Local Minima? , 2018, ICML.

[40]  Wenqing Hu,et al.  On the diffusion approximation of nonconvex stochastic gradient descent , 2017, Annals of Mathematical Sciences and Applications.