论文信息 - Convergence of Adam for Non-convex Objectives: Relaxed Hyperparameters and Non-ergodic Case - 字舞流文

Convergence of Adam for Non-convex Objectives: Relaxed Hyperparameters and Non-ergodic Case

Adam is a commonly used stochastic optimization algorithm in machine learning. However, its convergence is still not fully understood, especially in the non-convex setting. This paper focuses on exploring hyperparameter settings for the convergence of vanilla Adam and tackling the challenges of non-ergodic convergence related to practical application. The primary contributions are summarized as follows: firstly, we introduce precise definitions of ergodic and non-ergodic convergence, which cover nearly all forms of convergence for stochastic optimization algorithms. Meanwhile, we emphasize the superiority of non-ergodic convergence over ergodic convergence. Secondly, we establish a weaker sufficient condition for the ergodic convergence guarantee of Adam, allowing a more relaxed choice of hyperparameters. On this basis, we achieve the almost sure ergodic convergence rate of Adam, which is arbitrarily close to $o(1/\sqrt{K})$. More importantly, we prove, for the first time, that the last iterate of Adam converges to a stationary point for non-convex objectives. Finally, we obtain the non-ergodic convergence rate of $O(1/K)$ for function values under the Polyak-Lojasiewicz (PL) condition. These findings build a solid theoretical foundation for Adam to solve non-convex stochastic optimization problems.

Dongpo Xu | Jinlan Liu | Meixuan He | Yuqing Liang

[1] Huishuai Zhang,et al. Convergence of AdaGrad for Non-convex Objectives: Simple Proofs and Relaxed Assumptions , 2023, ArXiv.

[2] Jian Li,et al. Simple and Optimal Stochastic Gradient Methods for Nonsmooth Nonconvex Optimization , 2022, J. Mach. Learn. Res..

[3] Huishuai Zhang,et al. Provable Adaptivity in Adam , 2022, ArXiv.

[4] Ruoyu Sun,et al. Adam Can Converge Without Any Modification on Update Rules , 2022, NeurIPS.

[5] Dongpo Xu,et al. On hyper-parameter selection for guaranteed convergence of RMSProp , 2022, Cognitive Neurodynamics.

[6] Dongpo Xu,et al. Last-iterate convergence analysis of stochastic momentum methods for neural networks , 2022, Neurocomputing.

[7] V. Cevher,et al. High Probability Bounds for a Class of Nonconvex Algorithms with AdaGrad Stepsize , 2022, ICLR.

[8] Jun Liu,et al. On Almost Sure Convergence Rates of Stochastic Gradient Methods , 2022, COLT.

[9] X. T. Tong,et al. Stochastic Gradient Descent with Dependent Data for Offline Reinforcement Learning , 2022, ArXiv.

[10] Rong Jin,et al. A Novel Convergence Analysis for Algorithms of the Adam Family , 2021, ArXiv.

[11] Miao Qi,et al. Convergence analysis of AdaBound with relaxed bound functions for non-convex optimization , 2021, Neural Networks.

[12] Heng Huang,et al. SUPER-ADAM: Faster and Universal Framework of Adaptive Gradients , 2021, NeurIPS.

[13] Xiaoyu Wang,et al. On the Convergence of Stochastic Gradient Descent with Bandwidth-based Step Size , 2021, J. Mach. Learn. Res..

[14] Aaron Defazio,et al. Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization , 2021, ArXiv.

[15] S'ebastien Gadat,et al. Asymptotic study of stochastic adaptive algorithm in non-convex landscape , 2020, J. Mach. Learn. Res..

[16] Dongsheng Li,et al. Novel Convergence Results of Adaptive Stochastic Gradient Descents , 2020, IEEE Transactions on Image Processing.

[17] Tianbao Yang,et al. Revisiting SGD with Increasingly Weighted Averaging: Optimization and Generalization Perspectives , 2020, ArXiv.

[18] L. Bottou,et al. A Simple Convergence Proof of Adam and Adagrad , 2020, Trans. Mach. Learn. Res..

[19] Peter Richt'arik,et al. Better Theory for SGD in the Nonconvex World , 2020, Trans. Mach. Learn. Res..

[20] Xiaoxia Wu,et al. Linear Convergence of Adaptive Stochastic Gradient Descent , 2019, AISTATS.

[21] Zebang Shen,et al. A deep learning method for Chinese singer identification , 2019, Tsinghua Science and Technology.

[22] Xu Sun,et al. Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.

[23] Ke Tang,et al. Stochastic Gradient Descent for Nonconvex Learning Without Bounded Gradient Assumptions , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[24] Li Shen,et al. A Sufficient Condition for Convergences of Adam and RMSProp , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[25] Peter Richtárik,et al. New Convergence Aspects of Stochastic Gradient Algorithms , 2018, J. Mach. Learn. Res..

[26] Raef Bassily,et al. On exponential convergence of SGD in non-convex over-parametrized learning , 2018, ArXiv.

[27] Hao Jiang,et al. Non-Ergodic Convergence Analysis of Heavy-Ball Algorithms , 2018, AAAI.

[28] Li Shen,et al. Weighted AdaGrad with Unified Momentum , 2018 .

[29] Ruoyu Sun,et al. On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.

[30] Jinghui Chen,et al. Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks , 2018, IJCAI.

[31] Xiaoxia Wu,et al. AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization , 2018, ICML.

[32] Sashank J. Reddi,et al. On the Convergence of Adam and Beyond , 2018, ICLR.

[33] Matthias Hein,et al. Variants of RMSProp and Adagrad with Logarithmic Regret Bounds , 2017, ICML.

[34] Sebastian Ruder,et al. An overview of gradient descent optimization algorithms , 2016, Vestnik komp'iuternykh i informatsionnykh tekhnologii.

[35] Antonin Chambolle,et al. On the ergodic convergence rates of a first-order primal–dual algorithm , 2016, Math. Program..

[36] Mark W. Schmidt,et al. Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition , 2016, ECML/PKDD.

[37] Jorge Nocedal,et al. Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[38] Guanghui Lan,et al. Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2015, Mathematical Programming.

[39] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[40] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[41] Surya Ganguli,et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[42] Yurii Nesterov,et al. Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[43] Marc'Aurelio Ranzato,et al. Large Scale Distributed Deep Networks , 2012, NIPS.

[44] Ohad Shamir,et al. Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[45] John C. Duchi,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011 .

[46] Martin J. Wainwright,et al. Information-theoretic lower bounds on the oracle complexity of convex optimization , 2009, NIPS.

[47] Alexander Shapiro,et al. Stochastic Approximation approach to Stochastic Programming , 2013 .

[48] John N. Tsitsiklis,et al. Gradient Convergence in Gradient methods with Errors , 1999, SIAM J. Optim..

[49] H. Robbins. A Stochastic Approximation Method , 1951 .

[50] Xingkang He,et al. Revisit last-iterate convergence of mSGD under milder requirement on step size , 2022, NeurIPS.

[51] Dongpo Xu,et al. SGD-rα: A real-time α-suffix averaging method for SGD with biased gradient estimates , 2022, Neurocomputing.

[52] Aaron Defazio,et al. Almost sure convergence rates for Stochastic Gradient Descent and Stochastic Heavy Ball , 2021, COLT.

[53] Léon Bottou,et al. Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[54] Janos Galambos,et al. Advanced probability theory , 1988 .