On the Last Iterate Convergence of Momentum Methods

SGD with Momentum (SGDM) is widely used for large scale optimization of machine learning problems. Yet, the theoretical understanding of this algorithm is not complete. In fact, even the most recent results require changes to the algorithm like an averaging scheme and a projection onto a bounded domain, which are never used in practice. Also, no lower bound is known for SGDM. In this paper, we prove for the first time that for any constant momentum factor, there exists a Lipschitz and convex function for which the last iterate of SGDM suffers from an error Ω( log T √ T ) after T steps. Based on this fact, we study a new class of (both adaptive and non-adaptive) Follow-The-Regularized-Leader-based SGDM algorithms with increasing momentum and shrinking updates. For these algorithms, we show that the last iterate has optimal convergence O( 1 √ T ) for unconstrained convex optimization problems. Further, we show that in the interpolation setting with convex and smooth functions, our new SGDM algorithm automatically converges at a rate of O( log T T ). Empirical results are shown as well.

[1]  Understanding the Role of Momentum in Non-Convex Optimization: Practical Insights from a Lyapunov Analysis , 2020, ArXiv.

[2]  Nicholas J. A. Harvey,et al.  Tight Analyses for Non-Smooth Stochastic Gradient Descent , 2018, COLT.

[3]  Ashok Cutkosky,et al.  Anytime Online-to-Batch, Optimism and Acceleration , 2019, ICML.

[4]  Yurii Nesterov,et al.  Primal-dual subgradient methods for convex problems , 2005, Math. Program..

[5]  Qing Tao,et al.  The Role of Momentum Parameters in the Optimal Convergence of Adaptive Polyak's Heavy-ball Methods , 2021, ICLR.

[6]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[7]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[8]  Aaron Defazio,et al.  Dual Averaging is Surprisingly Effective for Deep Learning Optimization , 2020, ArXiv.

[9]  Mark W. Schmidt,et al.  Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron , 2018, AISTATS.

[10]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[11]  Xiaoxia Wu,et al.  AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization , 2018, ICML.

[12]  Tianbao Yang,et al.  Unified Convergence Analysis of Stochastic Momentum Methods for Convex and Non-convex Optimization , 2016, 1604.03257.

[13]  Raef Bassily,et al.  The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning , 2017, ICML.

[14]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[15]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[16]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[17]  Prateek Jain,et al.  Accelerating Stochastic Gradient Descent , 2017, COLT.

[18]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[19]  Volkan Cevher,et al.  A new regret analysis for Adam-type algorithms , 2020, ICML.

[20]  Zhisong Pan,et al.  Primal Averaging: A New Gradient Evaluation Step to Attain the Optimal Individual Convergence , 2020, IEEE Transactions on Cybernetics.

[21]  Francesco Orabona,et al.  On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes , 2018, AISTATS.

[22]  Francis R. Bach,et al.  Harder, Better, Faster, Stronger Convergence Rates for Least-Squares Regression , 2016, J. Mach. Learn. Res..

[23]  Francesco Orabona,et al.  Momentum-Based Variance Reduction in Non-Convex SGD , 2019, NeurIPS.

[24]  Elad Hazan,et al.  Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization , 2008, COLT.

[25]  Martin J. Wainwright,et al.  Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization , 2010, IEEE Transactions on Information Theory.

[26]  Xu Sun,et al.  Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.

[27]  Shai Shalev-Shwartz,et al.  Online learning: theory, algorithms and applications (למידה מקוונת.) , 2007 .

[28]  Yu. Nesterov,et al.  Quasi-monotone Subgradient Methods for Nonsmooth Convex Minimization , 2015, J. Optim. Theory Appl..

[29]  Matthew J. Streeter,et al.  Adaptive Bound Optimization for Online Convex Optimization , 2010, COLT 2010.

[30]  Francesco Orabona A Modern Introduction to Online Learning , 2019, ArXiv.