Towards Understanding Why Lookahead Generalizes Better Than SGD and Beyond

To train networks, lookahead algorithm [1] updates its fast weights k times via an inner-loop optimizer before updating its slow weights once by using the latest fast weights. Any optimizer, e.g. SGD, can serve as the inner-loop optimizer, and the derived lookahead generally enjoys remarkable test performance improvement over the vanilla optimizer. But theoretical understandings on the test performance improvement of lookahead remain absent yet. To solve this issue, we theoretically justify the advantages of lookahead in terms of the excess risk error which measures the test performance. Specifically, we prove that lookahead using SGD as its inner-loop optimizer can better balance the optimization error and generalization error to achieve smaller excess risk error than vanilla SGD on (strongly) convex problems and nonconvex problems with Polyak-Łojasiewicz condition which has been observed/proved in neural networks. Moreover, we show the stagewise optimization strategy [2] which decays learning rate several times during training can also benefit lookahead in improving its optimization and generalization errors on strongly convex problems. Finally, we propose a stagewise locally-regularized lookahead (SLRLA) algorithm which sums up the vanilla objective and a local regularizer to minimize at each stage and provably enjoys optimization and generalization improvement over the conventional (stagewise) lookahead. Experimental results on CIFAR10/100 and ImageNet testify its advantages. Codes is available at https://github.com/sail-sg/SLRLA-optimizer .

[1]  Liang Lin,et al.  Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition , 2021, EMNLP.

[2]  Xiao-Tong Yuan,et al.  A Theory-Driven Self-Labeling Refinement Method for Contrastive Representation Learning , 2021, NeurIPS.

[3]  Sheng Huang,et al.  Weakly Supervised Patch Label Inference Network with Image Pyramid for Pavement Diseases Recognition in the Wild , 2021, ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4]  Mayank Goswami,et al.  Stability of SGD: Tightness Analysis and Improved Bounds , 2021, UAI.

[5]  Hao Li,et al.  AsymptoticNG: A regularized natural gradient optimization algorithm with look-ahead strategy , 2020, ArXiv.

[6]  J. Duncan,et al.  AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients , 2020, NeurIPS.

[7]  Pan Zhou,et al.  Towards Theoretically Understanding Why SGD Generalizes Better Than ADAM in Deep Learning , 2020, NeurIPS.

[8]  Pan Zhou,et al.  Hybrid Stochastic-Deterministic Minibatch Proximal Gradient: Less-Than-Single-Pass Optimization with Nearly Optimal Generalization , 2020, ICML.

[9]  R. Socher,et al.  Theory-Inspired Path-Regularized Differential Network Architecture Search , 2020, NeurIPS.

[10]  Junnan Li,et al.  Prototypical Contrastive Learning of Unsupervised Representations , 2020, ICLR.

[11]  Jianyu Wang,et al.  Lookahead Converges to Stationary Points of Smooth Non-convex Functions , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Aditya Ganeshan,et al.  Meta-Learning Extractors for Music Source Separation , 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Liyuan Liu,et al.  On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[14]  Geoffrey E. Hinton,et al.  Lookahead Optimizer: k steps forward, 1 step back , 2019, NeurIPS.

[15]  Xu Sun,et al.  Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.

[16]  Yang Yuan,et al.  Asymmetric Valleys: Beyond Sharp and Flat Local Minima , 2019, NeurIPS.

[17]  Levent Sagun,et al.  A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks , 2019, ICML.

[18]  Yan Yan,et al.  Stagewise Training Accelerates Convergence of Testing Error Over SGD , 2018, NeurIPS.

[19]  Pan Zhou,et al.  Faster First-Order Methods for Stochastic Non-Convex Optimization on Riemannian Manifolds , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Jinghui Chen,et al.  Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks , 2018, IJCAI.

[21]  Jiashi Feng,et al.  Understanding Generalization and Optimization Performance of Deep CNNs , 2018, ICML.

[22]  Jiashi Feng,et al.  Empirical Risk Landscape Analysis for Understanding Deep Neural Networks , 2018, ICLR.

[23]  Richard Socher,et al.  Improving Generalization Performance by Switching from Adam to SGD , 2017, ArXiv.

[24]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[25]  Dimitris S. Papailiopoulos,et al.  Stability and Generalization of Learning Algorithms that Converge to Global Optima , 2017, ICML.

[26]  Yi Zhou,et al.  Characterization of Gradient Dominance and Regularity Conditions for Neural Networks , 2017, ArXiv.

[27]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[28]  Changjiang Zhang,et al.  An improved Adam Algorithm using look-ahead , 2017, ICDLT '17.

[29]  Nathan Srebro,et al.  The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[30]  Tuomas Sandholm,et al.  Safe and Nested Subgame Solving for Imperfect-Information Games , 2017, NIPS.

[31]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[32]  Le Song,et al.  Diverse Neural Network Learns True Target Functions , 2016, AISTATS.

[33]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[34]  Michael I. Jordan,et al.  Less than a Single Pass: Stochastically Controlled Stochastic Gradient , 2016, AISTATS.

[35]  Nikos Komodakis,et al.  Wide Residual Networks , 2016, BMVC.

[36]  Alexander J. Smola,et al.  Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[37]  Demis Hassabis,et al.  Mastering the game of Go with deep neural networks and tree search , 2016, Nature.

[38]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Paul W. Fieguth,et al.  Stage-wise Training: An Improved Feature Learning Strategy for Deep Models , 2015, FE@NIPS.

[40]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[41]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[42]  Gerald Penn,et al.  Convolutional Neural Networks for Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[43]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[45]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[46]  Tara N. Sainath,et al.  Deep convolutional neural networks for LVCSR , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[47]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[48]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[49]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[50]  Fei-Fei Li,et al.  ImageNet: A large-scale hierarchical image database , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[51]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[52]  H. Robbins A Stochastic Approximation Method , 1951 .

[53]  Caiming Xiong,et al.  Task similarity aware meta learning: theory-inspired improvement on MAML , 2021, UAI.

[54]  Nenghai Yu,et al.  A Simple Baseline for StyleGAN Inversion , 2021, ArXiv.

[55]  Shuicheng Yan,et al.  Efficient Meta Learning via Minibatch Proximal Update , 2019, NeurIPS.

[56]  Jiashi Feng,et al.  Efficient Stochastic Gradient Hard Thresholding , 2018, NeurIPS.

[57]  Jiashi Feng,et al.  New Insight into Hybrid Stochastic Gradient Descent: Beyond With-Replacement Sampling and Convexity , 2018, NeurIPS.

[58]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[59]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.