On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes

Stochastic gradient descent is the method of choice for large scale optimization of machine learning objective functions. Yet, its performance is greatly variable and heavily depends on the choice of the stepsizes. This has motivated a large body of research on adaptive stepsizes. However, there is currently a gap in our theoretical understanding of these methods, especially in the non-convex setting. In this paper, we start closing this gap: we theoretically analyze in the convex and non-convex settings a generalized version of the AdaGrad stepsizes. We show sufficient conditions for these stepsizes to achieve almost sure asymptotic convergence of the gradients to zero, proving the first guarantee for generalized AdaGrad stepsizes in the non-convex setting. Moreover, we show that these stepsizes allow to automatically adapt to the level of noise of the stochastic gradients in both the convex and non-convex settings, interpolating between $O(1/T)$ and $O(1/\sqrt{T})$, up to logarithmic terms.

[1]  Julien Mairal,et al.  Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization , 2013, NIPS.

[2]  H. Robbins A Stochastic Approximation Method , 1951 .

[3]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[4]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[5]  Léon Bottou,et al.  On-line learning and stochastic approximations , 1999 .

[6]  Ohad Shamir,et al.  Optimal Distributed Online Prediction Using Mini-Batches , 2010, J. Mach. Learn. Res..

[7]  Angelia Nedic,et al.  On stochastic gradient and subgradient methods with adaptive steplength sequences , 2011, Autom..

[8]  Yuan Cao,et al.  On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization , 2018, ArXiv.

[9]  Xiaoxia Wu,et al.  L ] 1 0 A pr 2 01 9 AdaGrad-Norm convergence over nonconvex landscapes AdaGrad stepsizes : sharp convergence over nonconvex landscapes , from any initialization , 2019 .

[10]  Stephen P. Boyd,et al.  Stochastic Mirror Descent in Variationally Coherent Optimization Problems , 2017, NIPS.

[11]  Zorana Luzanin,et al.  Adaptive stochastic approximation algorithm , 2017, Numerical Algorithms.

[12]  John N. Tsitsiklis,et al.  Gradient Convergence in Gradient methods with Errors , 1999, SIAM J. Optim..

[13]  Claudio Gentile,et al.  Adaptive and Self-Confident On-Line Learning Algorithms , 2000, J. Comput. Syst. Sci..

[14]  Karthik Sridharan,et al.  Optimization, Learning, and Games with Predictable Sequences , 2013, NIPS.

[15]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[16]  Francesco Orabona,et al.  Scale-free online learning , 2016, Theor. Comput. Sci..

[17]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[18]  Alfredo N. Iusem,et al.  On the projected subgradient method for nonsmooth convex optimization in a Hilbert space , 1998, Math. Program..

[19]  Francesco Orabona,et al.  Coin Betting and Parameter-Free Online Learning , 2016, NIPS.

[20]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[21]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[22]  Boris Polyak Gradient methods for the minimisation of functionals , 1963 .

[23]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[24]  Volkan Cevher,et al.  Online Adaptive Methods, Universality and Acceleration , 2018, NeurIPS.

[25]  Xiaoxia Wu,et al.  WNGrad: Learn the Learning Rate in Gradient Descent , 2018, ArXiv.

[26]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[27]  L. Eon Bottou Online Learning and Stochastic Approximations , 1998 .

[28]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[29]  Francesco Orabona,et al.  Black-Box Reductions for Parameter-free Online Learning in Banach Spaces , 2018, COLT.