"Convex Until Proven Guilty": Dimension-Free Acceleration of Gradient Descent on Non-Convex Functions

We develop and analyze a variant of Nesterov's accelerated gradient descent (AGD) for minimization of smooth non-convex functions. We prove that one of two cases occurs: either our AGD variant converges quickly, as if the function was convex, or we produce a certificate that the function is "guilty" of being non-convex. This non-convexity certificate allows us to exploit negative curvature and obtain deterministic, dimension-free acceleration of convergence for non-convex functions. For a function $f$ with Lipschitz continuous gradient and Hessian, we compute a point $x$ with $\|\nabla f(x)\| \le \epsilon$ in $O(\epsilon^{-7/4} \log(1/ \epsilon) )$ gradient and function evaluations. Assuming additionally that the third derivative is Lipschitz, we require only $O(\epsilon^{-5/3} \log(1/ \epsilon) )$ evaluations.

[1]  Yair Carmon,et al.  Accelerated Methods for Non-Convex Optimization , 2016, SIAM J. Optim..

[2]  Xiaodong Li,et al.  Phase Retrieval via Wirtinger Flow: Theory and Algorithms , 2014, IEEE Transactions on Information Theory.

[3]  Tengyu Ma,et al.  Finding Approximate Local Minima for Nonconvex Optimization in Linear Time , 2016, ArXiv.

[4]  J. Tukey,et al.  The Fitting of Power Series, Meaning Polynomials, Illustrated on Band-Spectroscopic Data , 1974 .

[5]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[6]  Emmanuel J. Candès,et al.  Adaptive Restart for Accelerated Gradient Schemes , 2012, Foundations of Computational Mathematics.

[7]  Michael I. Jordan,et al.  Gradient Descent Only Converges to Minimizers , 2016, COLT.

[8]  Tengyu Ma,et al.  Finding approximate local minima faster than gradient descent , 2016, STOC.

[9]  Nicholas I. M. Gould,et al.  On the Complexity of Steepest Descent, Newton's and Regularized Newton's Methods for Nonconvex Unconstrained Optimization Problems , 2010, SIAM J. Optim..

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  W. Hager,et al.  A SURVEY OF NONLINEAR CONJUGATE GRADIENT METHODS , 2005 .

[12]  Yurii Nesterov,et al.  Squared Functional Systems and Optimization Problems , 2000 .

[13]  Oriol Vinyals,et al.  Qualitatively characterizing neural network optimization problems , 2014, ICLR.

[14]  Yehuda Koren,et al.  Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[15]  Marc Teboulle,et al.  Gradient-based algorithms with applications to signal-recovery problems , 2010, Convex Optimization in Signal Processing and Communications.

[16]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[17]  Guillermo Sapiro,et al.  Supervised Dictionary Learning , 2008, NIPS.

[18]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[19]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[20]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[21]  Katta G. Murty,et al.  Some NP-complete problems in quadratic and nonlinear programming , 1987, Math. Program..

[22]  Convex Optimization in Signal Processing and Communications , 2010 .

[23]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[24]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[25]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[26]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[27]  Yonina C. Eldar,et al.  Solving Systems of Random Quadratic Equations via Truncated Amplitude Flow , 2016, IEEE Transactions on Information Theory.

[28]  Zeyuan Allen-Zhu,et al.  How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD , 2018, NeurIPS.

[29]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[30]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..