Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent

Nesterov's accelerated gradient descent (AGD), an instance of the general family of "momentum methods", provably achieves faster convergence rate than gradient descent (GD) in the convex setting. However, whether these methods are superior to GD in the nonconvex setting remains open. This paper studies a simple variant of AGD, and shows that it escapes saddle points and finds a second-order stationary point in $\tilde{O}(1/\epsilon^{7/4})$ iterations, faster than the $\tilde{O}(1/\epsilon^{2})$ iterations required by GD. To the best of our knowledge, this is the first Hessian-free algorithm to find a second-order stationary point faster than GD, and also the first single-loop algorithm with a faster rate than GD even in the setting of finding a first-order stationary point. Our analysis is based on two key ideas: (1) the use of a simple Hamiltonian function, inspired by a continuous-time perspective, which AGD monotonically decreases per step even for nonconvex functions, and (2) a novel framework called improve or localize, which is useful for tracking the long-term behavior of gradient-based optimization algorithms. We believe that these techniques may deepen our understanding of both acceleration algorithms and nonconvex optimization.

[1]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[2]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[3]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[4]  Nicholas I. M. Gould,et al.  On the Complexity of Steepest Descent, Newton's and Regularized Newton's Methods for Nonconvex Unconstrained Optimization Problems , 2010, SIAM J. Optim..

[5]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[6]  Yin Tat Lee,et al.  Efficient Accelerated Coordinate Descent Methods and Faster Algorithms for Solving Linear Systems , 2013, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[7]  Geoffrey E. Hinton,et al.  On the importance of initialization and momentum in deep learning , 2013, ICML.

[8]  Yann LeCun,et al.  The Loss Surface of Multilayer Networks , 2014, ArXiv.

[9]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[10]  Mohit Singh,et al.  A geometric alternative to Nesterov's accelerated gradient descent , 2015, ArXiv.

[11]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[12]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[13]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[14]  Saeed Ghadimi,et al.  Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2013, Mathematical Programming.

[15]  Nicolas Boumal,et al.  The non-convex Burer-Monteiro approach works on smooth semidefinite programs , 2016, NIPS.

[16]  Nathan Srebro,et al.  Global Optimality of Local Search for Low Rank Matrix Recovery , 2016, NIPS.

[17]  Stephen P. Boyd,et al.  A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights , 2014, J. Mach. Learn. Res..

[18]  Andre Wibisono,et al.  A variational perspective on accelerated methods in optimization , 2016, Proceedings of the National Academy of Sciences.

[19]  Nicolas Boumal,et al.  On the low-rank approach for semidefinite programs arising in synchronization and community detection , 2016, COLT.

[20]  Yair Carmon,et al.  Accelerated Methods for Non-Convex Optimization , 2016, SIAM J. Optim..

[21]  Tengyu Ma,et al.  Matrix Completion has No Spurious Local Minimum , 2016, NIPS.

[22]  Michael I. Jordan,et al.  A Lyapunov Analysis of Momentum Methods in Optimization , 2016, ArXiv.

[23]  Tong Zhang,et al.  Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Mathematical Programming.

[24]  Yi Zheng,et al.  No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis , 2017, ICML.

[25]  Daniel P. Robinson,et al.  A trust region algorithm with a worst-case iteration complexity of O(ϵ-3/2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{docume , 2016, Mathematical Programming.

[26]  Yair Carmon,et al.  "Convex Until Proven Guilty": Dimension-Free Acceleration of Gradient Descent on Non-Convex Functions , 2017, ICML.

[27]  Tianbao Yang,et al.  NEON+: Accelerated Gradient Methods for Extracting Negative Curvature for Non-Convex Optimization , 2017, 1712.01033.

[28]  Michael O'Neill,et al.  Behavior of Accelerated Gradient Methods Near Critical Points of Nonconvex Problems , 2017 .

[29]  Zeyuan Allen Zhu,et al.  Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent , 2014, ITCS.

[30]  Andrea Montanari,et al.  Solving SDPs for synchronization and MaxCut problems via the Grothendieck inequality , 2017, COLT.

[31]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[32]  Tengyu Ma,et al.  Finding approximate local minima faster than gradient descent , 2016, STOC.

[33]  Stephen J. Wright,et al.  Complexity Analysis of Second-Order Line-Search Algorithms for Smooth Nonconvex Optimization , 2017, SIAM J. Optim..

[34]  Yuanzhi Li,et al.  Neon2: Finding Local Minima via First-Order Oracles , 2017, NeurIPS.

[35]  Yair Carmon,et al.  Accelerated Methods for NonConvex Optimization , 2018, SIAM J. Optim..

[36]  Yurii Nesterov,et al.  Linear convergence of first order methods for non-strongly convex optimization , 2015, Math. Program..

[37]  Huan Li,et al.  Provable accelerated gradient method for nonconvex low rank optimization , 2017, Machine Learning.