Lookahead Optimizer: k steps forward, 1 step back

The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into two approaches: (1) adaptive learning rate schemes, such as AdaGrad and Adam, and (2) accelerated schemes, such as heavy-ball and Nesterov momentum. In this paper, we propose a new optimization algorithm, Lookahead, that is orthogonal to these previous approaches and iteratively updates two sets of weights. Intuitively, the algorithm chooses a search direction by looking ahead at the sequence of fast weights generated by another optimizer. We show that Lookahead improves the learning stability and lowers the variance of its inner optimizer with negligible computation and memory cost. We empirically demonstrate Lookahead can significantly improve the performance of SGD and Adam, even with their default hyperparameter settings on ImageNet, CIFAR-10/100, neural machine translation, and Penn Treebank.

[1]  Graham W. Taylor,et al.  Improved Regularization of Convolutional Neural Networks with Cutout , 2017, ArXiv.

[2]  Joshua Achiam,et al.  On First-Order Meta-Learning Algorithms , 2018, ArXiv.

[3]  Geoffrey E. Hinton Using fast weights to deblur old memories , 1987 .

[4]  Yoshua Bengio,et al.  On Using Very Large Target Vocabulary for Neural Machine Translation , 2014, ACL.

[5]  Zeyuan Allen-Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2016, J. Mach. Learn. Res..

[6]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[7]  Boris Polyak,et al.  Acceleration of stochastic approximation by averaging , 1992 .

[8]  Frank Hutter,et al.  Fixing Weight Decay Regularization in Adam , 2017, ArXiv.

[9]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[10]  Benjamin Recht,et al.  Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints , 2014, SIAM J. Optim..

[11]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[12]  Ren-Cang Li 05-01 Sharpness in Rates of Convergence For CG and Symmetric Lanczos Methods , 2005 .

[13]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[14]  Léon Bottou,et al.  On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , 2018, NeurIPS.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[17]  Yuanzhou Yang,et al.  Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes , 2018, ArXiv.

[18]  James Martens,et al.  New Insights and Perspectives on the Natural Gradient Method , 2014, J. Mach. Learn. Res..

[19]  Alexandre d'Aspremont,et al.  Nonlinear Acceleration of Deep Neural Networks , 2018, ArXiv.

[20]  D. Ruppert,et al.  Efficient Estimations from a Slowly Convergent Robbins-Monro Process , 1988 .

[21]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[22]  Richard S. Varga,et al.  Extrapolation methods: theory and practice , 1993, Numerical Algorithms.

[23]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[24]  Renjie Liao,et al.  Understanding Short-Horizon Bias in Stochastic Meta-Optimization , 2018, ICLR.

[25]  Guodong Zhang,et al.  Which Algorithmic Choices Matter at Which Batch Sizes? Insights From a Noisy Quadratic Model , 2019, NeurIPS.

[26]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[27]  Emmanuel J. Candès,et al.  Adaptive Restart for Accelerated Gradient Schemes , 2012, Foundations of Computational Mathematics.

[28]  Donald G. M. Anderson Iterative Procedures for Nonlinear Integral Equations , 1965, JACM.

[29]  J. Schulman,et al.  Reptile: a Scalable Metalearning Algorithm , 2018 .

[30]  Gabriel Goh,et al.  Why Momentum Really Works , 2017 .

[31]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[32]  Noam Shazeer,et al.  Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[33]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[34]  James Martens,et al.  New perspectives on the natural gradient method , 2014, ArXiv.

[35]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[36]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[37]  Andrew Gordon Wilson,et al.  Averaging Weights Leads to Wider Optima and Better Generalization , 2018, UAI.

[38]  Richard S. Zemel,et al.  Aggregated Momentum: Stability Through Passive Damping , 2018, ICLR.

[39]  Stefan Winkler,et al.  The Unusual Effectiveness of Averaging in GAN Training , 2018, ICLR.

[40]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[41]  Tom Schaul,et al.  No more pesky learning rates , 2012, ICML.

[42]  Alexandre d'Aspremont,et al.  Nonlinear Acceleration of CNNs , 2018, ICLR.

[43]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[44]  J. van Leeuwen,et al.  Neural Networks: Tricks of the Trade , 2002, Lecture Notes in Computer Science.

[45]  Andrew Gordon Wilson,et al.  Loss Surfaces, Mode Connectivity, and Fast Ensembling of DNNs , 2018, NeurIPS.

[46]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[47]  Ren-Cang Li,et al.  Sharpness in rates of convergence for the symmetric Lanczos method , 2010, Math. Comput..

[48]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[49]  James Demmel,et al.  ImageNet Training in Minutes , 2017, ICPP.

[50]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[51]  Satoshi Matsuoka,et al.  Second-order Optimization Method for Large Mini-batch: Training ResNet-50 on ImageNet in 35 Epochs , 2018, ArXiv.