On Stochastic Moving-Average Estimators for Non-Convex Optimization

In this paper, we consider the widely used but not fully understood stochastic estimator based on moving average (SEMA), which only requires a general unbiased stochastic oracle. We demonstrate the power of SEMA on a range of stochastic non-convex optimization problems. In particular, we analyze various stochastic methods (existing or newly proposed) based on the variance recursion property of SEMA for three families of non-convex optimization, namely standard stochastic non-convex minimization, stochastic non-convex strongly-concave min-max optimization, and stochastic bilevel optimization. Our contributions include: (i) for standard stochastic non-convex minimization, we present a simple and intuitive proof of convergence for a family of Adam-style methods (including Adam, AMSGrad, AdaBound, etc.) with an increasing or large “momentum" parameter for the first-order moment, which gives an alternative yet more natural way to guarantee Adam converge; (ii) for stochastic non-convex strongly-concave min-max optimization, we present a single-loop primal-dual stochastic momentum and adaptive methods based on the moving average estimators and establish its oracle complexity of O(1/ǫ) without using a large mini-batch size, addressing a gap in the literature; (iii) for stochastic bilevel optimization, we present a single-loop stochastic method based on the moving average estimators and establish its oracle complexity of Õ(1/ǫ) without computing the SVD of the Hessian matrix, improving state-of-the-art results. For all these problems, we also establish a variance diminishing result for the used stochastic gradient estimators.

[1]  Jinghui Chen,et al.  Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks , 2018, IJCAI.

[2]  Mingrui Liu,et al.  Weakly-convex–concave min–max optimization: provable algorithms and applications in machine learning , 2018, Optim. Methods Softw..

[3]  J. Pei,et al.  Accelerated Zeroth-Order Momentum Methods from Mini to Minimax Optimization , 2020, arXiv.org.

[4]  Ji Liu,et al.  Gradient Sparsification for Communication-Efficient Distributed Optimization , 2017, NeurIPS.

[5]  Rong Jin,et al.  On the Linear Speedup Analysis of Communication Efficient Momentum SGD for Distributed Non-Convex Optimization , 2019, ICML.

[6]  Li Shen,et al.  On the Convergence of AdaGrad with Momentum for Training Deep Neural Networks , 2018, ArXiv.

[7]  Paolo Frasconi,et al.  Bilevel Programming for Hyperparameter Optimization and Meta-Learning , 2018, ICML.

[8]  Tianbao Yang,et al.  Unified Convergence Analysis of Stochastic Momentum Methods for Convex and Non-convex Optimization , 2016, 1604.03257.

[9]  Junjie Yang,et al.  Provably Faster Algorithms for Bilevel Optimization and Applications to Meta-Learning , 2020, ArXiv.

[10]  Pedro Savarese On the Convergence of AdaBound and its Connection to SGD , 2019, ArXiv.

[11]  Saeed Ghadimi,et al.  A Single Timescale Stochastic Approximation Method for Nested Stochastic Optimization , 2018, SIAM J. Optim..

[12]  Tong Zhang,et al.  Stochastic Recursive Gradient Descent Ascent for Stochastic Nonconvex-Strongly-Concave Minimax Problems , 2020, NeurIPS.

[13]  A Single-Timescale Stochastic Bilevel Optimization Method , 2021, ArXiv.

[14]  Lam M. Nguyen,et al.  ProxSARAH: An Efficient Algorithmic Framework for Stochastic Composite Nonconvex Optimization , 2019, J. Mach. Learn. Res..

[15]  Vivek S. Borkar,et al.  Actor-Critic - Type Learning Algorithms for Markov Decision Processes , 1999, SIAM J. Control. Optim..

[16]  Saeed Ghadimi,et al.  Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2013, Mathematical Programming.

[17]  Saeed Ghadimi,et al.  Approximation Methods for Bilevel Programming , 2018, 1802.02246.

[18]  Quoc V. Le,et al.  Adding Gradient Noise Improves Learning for Very Deep Networks , 2015, ArXiv.

[19]  Dan Alistarh,et al.  QSGD: Communication-Optimal Stochastic Gradient Descent, with Applications to Training Neural Networks , 2016, 1610.02132.

[20]  Wei Liu,et al.  Optimal Epoch Stochastic Gradient Descent Ascent Methods for Min-Max Optimization , 2020, NeurIPS.

[21]  Avrim Blum,et al.  Online Geometric Optimization in the Bandit Setting Against an Adaptive Adversary , 2004, COLT.

[22]  Mingrui Liu,et al.  Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate , 2018, ICML.

[23]  Mingyi Hong,et al.  On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.

[24]  Mengdi Wang,et al.  Accelerating Stochastic Composition Optimization , 2016, NIPS.

[25]  Mingyi Hong,et al.  RMSprop converges with proper hyper-parameter , 2021, ICLR.

[26]  Li Shen,et al.  A Sufficient Condition for Convergences of Adam and RMSProp , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  Wotao Yin,et al.  An Improved Analysis of Stochastic Gradient Descent with Momentum , 2020, NeurIPS.

[28]  Michael I. Jordan,et al.  On Gradient Descent Ascent for Nonconvex-Concave Minimax Problems , 2019, ICML.

[29]  Francesco Orabona,et al.  On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes , 2018, AISTATS.

[30]  Xu Sun,et al.  Adaptive Gradient Methods with Dynamic Bound of Learning Rate , 2019, ICLR.

[31]  Francis Bach,et al.  A Simple Convergence Proof of Adam and Adagrad , 2020 .

[32]  Liyuan Liu,et al.  On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[33]  Zhaoran Wang,et al.  A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic , 2020, ArXiv.

[34]  Ashok Cutkosky,et al.  Momentum Improves Normalized SGD , 2020, ICML.

[35]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[36]  Nathan Srebro,et al.  Lower Bounds for Non-Convex Stochastic Optimization , 2019, ArXiv.

[37]  Dan Alistarh,et al.  ZipML: Training Linear Models with End-to-End Low Precision, and a Little Bit of Deep Learning , 2017, ICML.

[38]  Mingrui Liu,et al.  Adam+: A Stochastic Method with Adaptive Variance Reduction , 2020, ArXiv.

[39]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[40]  Xiaoxia Wu,et al.  AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization , 2018, ICML.

[41]  Alternating proximal-gradient steps for (stochastic) nonconvex-concave minimax problems , 2020, 2007.13605.

[42]  Mengdi Wang,et al.  Stochastic compositional gradient descent: algorithms for minimizing compositions of expected-value functions , 2014, Mathematical Programming.

[43]  Francesco Orabona,et al.  Momentum-Based Variance Reduction in Non-Convex SGD , 2019, NeurIPS.

[44]  Sergey Levine,et al.  Meta-Learning with Implicit Gradients , 2019, NeurIPS.

[45]  Quoc Tran-Dinh,et al.  Hybrid Variance-Reduced SGD Algorithms For Minimax Problems with Nonconvex-Linear Function , 2020, NeurIPS.

[46]  Yiming Yang,et al.  DARTS: Differentiable Architecture Search , 2018, ICLR.

[47]  Tianbao Yang,et al.  Fast Objective and Duality Gap Convergence for Non-convex Strongly-concave Min-max Problems , 2020, ArXiv.

[48]  Tong Zhang,et al.  SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.