Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems

Learning rate schedule can significantly affect generalization performance in modern neural networks, but the reasons for this are not yet understood. Li-Wei-Ma (2019) recently proved this behavior can exist in a simplified non-convex neural-network setting. In this note, we show that this phenomenon can exist even for convex learning problems -- in particular, linear regression in 2 dimensions. We give a toy convex problem where learning rate annealing (large initial learning rate, followed by small learning rate) can lead gradient descent to minima with provably better generalization than using a small learning rate throughout. In our case, this occurs due to a combination of the mismatch between the test and train loss landscapes, and early-stopping.

[1]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[2]  Colin Wei,et al.  Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks , 2019, NeurIPS.

[3]  Alex Krizhevsky,et al.  One weird trick for parallelizing convolutional neural networks , 2014, ArXiv.

[4]  Dario Amodei,et al.  An Empirical Model of Large-Batch Training , 2018, ArXiv.

[5]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[6]  Jascha Sohl-Dickstein,et al.  The large learning rate phase of deep learning: the catapult mechanism , 2020, ArXiv.

[7]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[8]  Elad Hoffer,et al.  Train longer, generalize better: closing the generalization gap in large batch training of neural networks , 2017, NIPS.

[9]  Yoshua Bengio,et al.  On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length , 2018, ICLR.

[10]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[11]  Kyunghyun Cho,et al.  The Break-Even Point on Optimization Trajectories of Deep Neural Networks , 2020, ICLR.

[12]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[13]  David M. Blei,et al.  Stochastic Gradient Descent as Approximate Bayesian Inference , 2017, J. Mach. Learn. Res..

[14]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[15]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[16]  Quoc V. Le,et al.  A Bayesian Perspective on Generalization and Stochastic Gradient Descent , 2017, ICLR.