Convergence of Adam for Non-convex Objectives: Relaxed Hyperparameters and Non-ergodic Case

Adam is a commonly used stochastic optimization algorithm in machine learning. However, its convergence is still not fully understood, especially in the non-convex setting. This paper focuses on exploring hyperparameter settings for the convergence of vanilla Adam and tackling the challenges of non-ergodic convergence related to practical application. The primary contributions are summarized as follows: firstly, we introduce precise definitions of ergodic and non-ergodic convergence, which cover nearly all forms of convergence for stochastic optimization algorithms. Meanwhile, we emphasize the superiority of non-ergodic convergence over ergodic convergence. Secondly, we establish a weaker sufficient condition for the ergodic convergence guarantee of Adam, allowing a more relaxed choice of hyperparameters. On this basis, we achieve the almost sure ergodic convergence rate of Adam, which is arbitrarily close to $o(1/\sqrt{K})$. More importantly, we prove, for the first time, that the last iterate of Adam converges to a stationary point for non-convex objectives. Finally, we obtain the non-ergodic convergence rate of $O(1/K)$ for function values under the Polyak-Lojasiewicz (PL) condition. These findings build a solid theoretical foundation for Adam to solve non-convex stochastic optimization problems.

[1]  Huishuai Zhang,et al.  Convergence of AdaGrad for Non-convex Objectives: Simple Proofs and Relaxed Assumptions , 2023, ArXiv.

[2]  Jian Li,et al.  Simple and Optimal Stochastic Gradient Methods for Nonsmooth Nonconvex Optimization , 2022, J. Mach. Learn. Res..

[3]  Huishuai Zhang,et al.  Provable Adaptivity in Adam , 2022, ArXiv.

[4]  Ruoyu Sun,et al.  Adam Can Converge Without Any Modification on Update Rules , 2022, NeurIPS.

[5]  Dongpo Xu,et al.  On hyper-parameter selection for guaranteed convergence of RMSProp , 2022, Cognitive Neurodynamics.

[6]  Dongpo Xu,et al.  Last-iterate convergence analysis of stochastic momentum methods for neural networks , 2022, Neurocomputing.

[7]  V. Cevher,et al.  High Probability Bounds for a Class of Nonconvex Algorithms with AdaGrad Stepsize , 2022, ICLR.

[8]  Jun Liu,et al.  On Almost Sure Convergence Rates of Stochastic Gradient Methods , 2022, COLT.

[9]  X. T. Tong,et al.  Stochastic Gradient Descent with Dependent Data for Offline Reinforcement Learning , 2022, ArXiv.

[10]  Rong Jin,et al.  A Novel Convergence Analysis for Algorithms of the Adam Family , 2021, ArXiv.

[11]  Miao Qi,et al.  Convergence analysis of AdaBound with relaxed bound functions for non-convex optimization , 2021, Neural Networks.

[12]  Heng Huang,et al.  SUPER-ADAM: Faster and Universal Framework of Adaptive Gradients , 2021, NeurIPS.

[13]  Xiaoyu Wang,et al.  On the Convergence of Stochastic Gradient Descent with Bandwidth-based Step Size , 2021, J. Mach. Learn. Res..

[14]  Aaron Defazio,et al.  Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization , 2021, ArXiv.

[15]  S'ebastien Gadat,et al.  Asymptotic study of stochastic adaptive algorithm in non-convex landscape , 2020, J. Mach. Learn. Res..

[16]  Dongsheng Li,et al.  Novel Convergence Results of Adaptive Stochastic Gradient Descents , 2020, IEEE Transactions on Image Processing.

[17]  Tianbao Yang,et al.  Revisiting SGD with Increasingly Weighted Averaging: Optimization and Generalization Perspectives , 2020, ArXiv.

[18]  L. Bottou,et al.  A Simple Convergence Proof of Adam and Adagrad , 2020, Trans. Mach. Learn. Res..

[19]  Peter Richt'arik,et al.  Better Theory for SGD in the Nonconvex World , 2020, Trans. Mach. Learn. Res..

[20]  Xiaoxia Wu,et al.  Linear Convergence of Adaptive Stochastic Gradient Descent , 2019, AISTATS.

[21]  Zebang Shen,et al.  A deep learning method for Chinese singer identification , 2019, Tsinghua Science and Technology.

[22]  Xu Sun,et al.  Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.

[23]  Ke Tang,et al.  Stochastic Gradient Descent for Nonconvex Learning Without Bounded Gradient Assumptions , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[24]  Li Shen,et al.  A Sufficient Condition for Convergences of Adam and RMSProp , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Peter Richtárik,et al.  New Convergence Aspects of Stochastic Gradient Algorithms , 2018, J. Mach. Learn. Res..

[26]  Raef Bassily,et al.  On exponential convergence of SGD in non-convex over-parametrized learning , 2018, ArXiv.

[27]  Hao Jiang,et al.  Non-Ergodic Convergence Analysis of Heavy-Ball Algorithms , 2018, AAAI.

[28]  Li Shen,et al.  Weighted AdaGrad with Unified Momentum , 2018 .

[29]  Ruoyu Sun,et al.  On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.

[30]  Jinghui Chen,et al.  Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks , 2018, IJCAI.

[31]  Xiaoxia Wu,et al.  AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization , 2018, ICML.

[32]  Sashank J. Reddi,et al.  On the Convergence of Adam and Beyond , 2018, ICLR.

[33]  Matthias Hein,et al.  Variants of RMSProp and Adagrad with Logarithmic Regret Bounds , 2017, ICML.

[34]  Sebastian Ruder,et al.  An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[35]  Antonin Chambolle,et al.  On the ergodic convergence rates of a first-order primal–dual algorithm , 2016, Math. Program..

[36]  Mark W. Schmidt,et al.  Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[37]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[38]  Guanghui Lan,et al.  Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2015, Mathematical Programming.

[39]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[40]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[41]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[42]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[43]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[44]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[45]  John C. Duchi,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011 .

[46]  Martin J. Wainwright,et al.  Information-theoretic lower bounds on the oracle complexity of convex optimization , 2009, NIPS.

[47]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[48]  John N. Tsitsiklis,et al.  Gradient Convergence in Gradient methods with Errors , 1999, SIAM J. Optim..

[49]  H. Robbins A Stochastic Approximation Method , 1951 .

[50]  Xingkang He,et al.  Revisit last-iterate convergence of mSGD under milder requirement on step size , 2022, NeurIPS.

[51]  Dongpo Xu,et al.  SGD-rα: A real-time α-suffix averaging method for SGD with biased gradient estimates , 2022, Neurocomputing.

[52]  Aaron Defazio,et al.  Almost sure convergence rates for Stochastic Gradient Descent and Stochastic Heavy Ball , 2021, COLT.

[53]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[54]  Janos Galambos,et al.  Advanced probability theory , 1988 .