Almost sure convergence rates for Stochastic Gradient Descent and Stochastic Heavy Ball

We study stochastic gradient descent (SGD) and the stochastic heavy ball method (SHB, otherwise known as the momentum method) for the general stochastic approximation problem. For SGD, in the convex and smooth setting, we provide the first almost sure asymptotic convergence rates for a weighted average of the iterates . More precisely, we show that the convergence rate of the function values is arbitrarily close to o(1/ √ k), and is exactly o(1/k) in the so-called overparametrized case. We show that these results still hold when using stochastic line search and stochastic Polyak stepsizes, thereby giving the first proof of convergence of these methods in the non-overparametrized regime. Using a substantially different analysis, we show that these rates hold for SHB as well, but at the last iterate. This distinction is important because it is the last iterate of SGD and SHB which is used in practice. We also show that the last iterate of SHB converges to a minimizer almost surely. Additionally, we prove that the function values of the deterministic HB converge at a o(1/k) rate, which is faster than the previously known O(1/k). Finally, in the nonconvex setting, we prove similar rates on the lowest gradient norm along the trajectory of SGD.

[1]  Euhanna Ghadimi,et al.  Global convergence of the Heavy-ball method for convex optimization , 2014, 2015 European Control Conference (ECC).

[2]  Mert Gürbüzbalaban,et al.  Accelerated Linear Convergence of Stochastic Momentum Methods in Wasserstein Distances , 2019, ICML.

[3]  Mark W. Schmidt,et al.  Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates , 2019, NeurIPS.

[4]  Hedy Attouch,et al.  The Rate of Convergence of Nesterov's Accelerated Forward-Backward Method is Actually Faster Than 1/k2 , 2015, SIAM J. Optim..

[5]  Frederik Kunstner,et al.  Adaptive Gradient Methods Converge Faster with Over-Parameterization (and you can do a line-search) , 2020, ArXiv.

[6]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[7]  A. Chambolle,et al.  On the Convergence of the Iterates of the “Fast Iterative Shrinkage/Thresholding Algorithm” , 2015, J. Optim. Theory Appl..

[8]  Peter Richt'arik,et al.  Better Theory for SGD in the Nonconvex World , 2020, Trans. Mach. Learn. Res..

[9]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[10]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[11]  Volkan Cevher,et al.  On the Almost Sure Convergence of Stochastic Gradient Descent in Non-Convex Problems , 2020, NeurIPS.

[12]  Léon Bottou,et al.  Stochastic Learning , 2003, Advanced Lectures on Machine Learning.

[13]  Aurélien Lucchi,et al.  The Role of Memory in Stochastic Optimization , 2019, UAI.

[14]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[15]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[16]  Tianbao Yang,et al.  Unified Convergence Analysis of Stochastic Momentum Methods for Convex and Non-convex Optimization , 2016, 1604.03257.

[17]  Francesco Orabona,et al.  On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes , 2018, AISTATS.

[18]  Aaron Defazio,et al.  On the Curved Geometry of Accelerated Optimization , 2018, NeurIPS.

[19]  Sharan Vaswani,et al.  Stochastic Polyak Step-size for SGD: An Adaptive Learning Rate for Fast Convergence , 2020, AISTATS.

[20]  Antoine Godichon-Baggioni $L^{p}$ and almost sure rates of convergence of averaged stochastic gradient algorithms with applications to online robust estimation , 2016 .

[21]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[22]  S. Gadat,et al.  Stochastic Heavy ball , 2016, 1609.04228.

[23]  Prateek Jain,et al.  On the Insufficiency of Existing Momentum Schemes for Stochastic Optimization , 2018, 2018 Information Theory and Applications Workshop (ITA).

[24]  Stephen J. Wright,et al.  First-Order Algorithms Converge Faster than $O(1/k)$ on Convex Problems , 2018, ICML.

[25]  Robert M. Gower,et al.  Optimal mini-batch and step sizes for SAGA , 2019, ICML.

[26]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[27]  Prateek Jain,et al.  Making the Last Iterate of SGD Information Theoretically Optimal , 2019, COLT.

[28]  Paul T. Boggs,et al.  Sequential Quadratic Programming , 1995, Acta Numerica.

[29]  John N. Tsitsiklis,et al.  Gradient Convergence in Gradient methods with Errors , 1999, SIAM J. Optim..

[30]  Yi Zhou,et al.  An optimal randomized incremental gradient method , 2015, Mathematical Programming.

[31]  H. Robbins A Stochastic Approximation Method , 1951 .

[32]  Peter Richtárik,et al.  Momentum and stochastic momentum for stochastic gradient, Newton, proximal point and subspace descent methods , 2017, Computational Optimization and Applications.

[33]  Stephen P. Boyd,et al.  Stochastic Mirror Descent in Variationally Coherent Optimization Problems , 2017, NIPS.

[34]  Marc Teboulle,et al.  A fast Iterative Shrinkage-Thresholding Algorithm with application to wavelet-based image deblurring , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[35]  Peter Richtárik,et al.  SGD and Hogwild! Convergence Without the Bounded Gradients Assumption , 2018, ICML.

[36]  Mark W. Schmidt,et al.  Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron , 2018, AISTATS.

[37]  Peter Richtárik,et al.  SGD: General Analysis and Improved Rates , 2019, ICML 2019.