Linear Convergence of Adaptive Stochastic Gradient Descent

We prove that the norm version of the adaptive stochastic gradient method (AdaGrad-Norm) achieves a linear convergence rate for a subset of either strongly convex functions or non-convex functions that satisfy the Polyak Lojasiewicz (PL) inequality. The paper introduces the notion of Restricted Uniform Inequality of Gradients (RUIG)---which is a measure of the balanced-ness of the stochastic gradient norms---to depict the landscape of a function. RUIG plays a key role in proving the robustness of AdaGrad-Norm to its hyper-parameter tuning in the stochastic setting. On top of RUIG, we develop a two-stage framework to prove the linear convergence of AdaGrad-Norm without knowing the parameters of the objective functions. This framework can likely be extended to other adaptive stepsize algorithms. The numerical experiments validate the theory and suggest future directions for improvement.

[1]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[2]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[3]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[4]  Matthias Hein,et al.  Variants of RMSProp and Adagrad with Logarithmic Regret Bounds , 2017, ICML.

[5]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[6]  Jorge Nocedal,et al.  A Numerical Study of the Limited Memory BFGS Method and the Truncated-Newton Method for Large Scale Optimization , 1991, SIAM J. Optim..

[7]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[8]  Raef Bassily,et al.  The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning , 2017, ICML.

[9]  Mark W. Schmidt,et al.  Fast Convergence of Stochastic Gradient Descent under a Strong Growth Condition , 2013, 1308.6370.

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Anastasios Kyrillidis,et al.  Minimum norm solutions do not always generalize well for over-parameterized problems , 2018, ArXiv.

[12]  Yi Zhou,et al.  SGD Converges to Global Minimum in Deep Learning via Star-convex Path , 2019, ICLR.

[13]  Philipp Hennig,et al.  Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients , 2017, ICML.

[14]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[15]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[16]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[17]  Haipeng Luo,et al.  Accelerated Parallel Optimization Methods for Large Scale Machine Learning , 2014, ArXiv.

[18]  Xiaoxia Wu,et al.  WNGrad: Learn the Learning Rate in Gradient Descent , 2018, ArXiv.

[19]  Xiaoxia Wu,et al.  L ] 1 0 A pr 2 01 9 AdaGrad-Norm convergence over nonconvex landscapes AdaGrad stepsizes : sharp convergence over nonconvex landscapes , from any initialization , 2019 .

[20]  Li Shen,et al.  A Sufficient Condition for Convergences of Adam and RMSProp , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Matthew J. Streeter,et al.  Adaptive Bound Optimization for Online Convex Optimization , 2010, COLT 2010.

[22]  Kfir Y. Levy,et al.  Online to Offline Conversions, Universality and Adaptive Minibatch Sizes , 2017, NIPS.

[23]  Léon Bottou,et al.  Diagonal Rescaling For Neural Networks , 2017, ArXiv.

[24]  Anastasios Kyrillidis,et al.  Minimum weight norm models do not always generalize well for over-parameterized problems , 2018 .

[25]  Kamyar Azizzadenesheli,et al.  signSGD: compressed optimisation for non-convex problems , 2018, ICML.

[26]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[27]  Mark W. Schmidt,et al.  Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron , 2018, AISTATS.

[28]  Simon Haykin,et al.  Cognitive radio: brain-empowered wireless communications , 2005, IEEE Journal on Selected Areas in Communications.

[29]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[30]  Yann LeCun,et al.  Large Scale Online Learning , 2003, NIPS.

[31]  Yuanzhi Li,et al.  An Alternative View: When Does SGD Escape Local Minima? , 2018, ICML.

[32]  Volkan Cevher,et al.  On the linear convergence of the stochastic gradient method with constant step-size , 2017, Optim. Lett..

[33]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[34]  Yuan Cao,et al.  On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization , 2018, ArXiv.

[35]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[36]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[37]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[38]  Xiaoxia Wu,et al.  Global Convergence of Adaptive Gradient Methods for An Over-parameterized Neural Network , 2019, ArXiv.

[39]  Mark W. Schmidt,et al.  A Stochastic Gradient Method with an Exponential Convergence Rate for Finite Training Sets , 2012, NIPS.

[40]  Enhong Chen,et al.  SADAGRAD: Strongly Adaptive Stochastic Gradient Methods , 2018, ICML.

[41]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[42]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[43]  Sanjiv Kumar,et al.  Escaping Saddle Points with Adaptive Gradient Methods , 2019, ICML.

[44]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[45]  Volkan Cevher,et al.  Online Adaptive Methods, Universality and Acceleration , 2018, NeurIPS.

[46]  Raef Bassily,et al.  On exponential convergence of SGD in non-convex over-parametrized learning , 2018, ArXiv.

[47]  Francesco Orabona,et al.  On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes , 2018, AISTATS.