论文信息 - A Simple Convergence Proof of Adam and Adagrad

A Simple Convergence Proof of Adam and Adagrad

We provide a simple proof of convergence covering both the Adam and Adagrad adaptive optimization algorithms when applied to smooth (possibly non-convex) objective functions with bounded gradients. We show that in expectation, the squared norm of the objective gradient averaged over the trajectory has an upper-bound which is explicit in the constants of the problem, parameters of the optimizer and the total number of iterations $N$. This bound can be made arbitrarily small: Adam with a learning rate $\alpha=1/\sqrt{N}$ and a momentum parameter on squared gradients $\beta_2=1-1/N$ achieves the same rate of convergence $O(\ln(N)/\sqrt{N})$ as Adagrad. Finally, we obtain the tightest dependency on the heavy ball momentum among all previous convergence bounds for non-convex Adam and Adagrad, improving from $O((1-\beta_1)^{-3})$ to $O((1-\beta_1)^{-1})$. Our technique also improves the best known dependency for standard SGD by a factor $1 - \beta_1$.

Francis Bach | Nicolas Usunier | Alexandre D'efossez | L'eon Bottou

[1] Sanjiv Kumar,et al. On the Convergence of Adam and Beyond , 2018 .

[2] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[3] Tianbao Yang,et al. Unified Convergence Analysis of Stochastic Momentum Methods for Convex and Non-convex Optimization , 2016, 1604.03257.

[4] Michael I. Jordan,et al. Estimation, Optimization, and Parallelism when Data is Sparse , 2013, NIPS.

[5] Nathan Srebro,et al. The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.

[6] Li Shen,et al. Weighted AdaGrad with Unified Momentum , 2018 .

[7] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[8] Xiaoxia Wu,et al. AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization , 2018, ICML.

[9] Yuan Cao,et al. On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization , 2018, ArXiv.

[10] Boris Ginsburg,et al. Comparison of Batch Normalization and Weight Normalization Algorithms for the Large-scale Image Classification , 2017, ArXiv.

[11] Francesco Orabona,et al. On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes , 2018, AISTATS.

[12] Ilya Sutskever,et al. Jukebox: A Generative Model for Music , 2020, ArXiv.

[13] Nicolas Usunier,et al. Canonical Tensor Decomposition for Knowledge Base Completion , 2018, ICML.

[14] Mingyi Hong,et al. On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.

[15] Diego Klabjan,et al. Convergence Analyses of Online ADAM Algorithm in Convex Setting and Two-Layer ReLU Neural Network , 2019, ArXiv.

[16] Li Shen,et al. A Sufficient Condition for Convergences of Adam and RMSProp , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17] Matthew J. Streeter,et al. Adaptive Bound Optimization for Online Convex Optimization , 2010, COLT 2010.

[18] Boris Polyak. Some methods of speeding up the convergence of iteration methods , 1964 .