论文信息 - "Convex Until Proven Guilty": Dimension-Free Acceleration of Gradient Descent on Non-Convex Functions

"Convex Until Proven Guilty": Dimension-Free Acceleration of Gradient Descent on Non-Convex Functions

We develop and analyze a variant of Nesterov's accelerated gradient descent (AGD) for minimization of smooth non-convex functions. We prove that one of two cases occurs: either our AGD variant converges quickly, as if the function was convex, or we produce a certificate that the function is "guilty" of being non-convex. This non-convexity certificate allows us to exploit negative curvature and obtain deterministic, dimension-free acceleration of convergence for non-convex functions. For a function $f$ with Lipschitz continuous gradient and Hessian, we compute a point $x$ with $\|\nabla f(x)\| \le \epsilon$ in $O(\epsilon^{-7/4} \log(1/ \epsilon) )$ gradient and function evaluations. Assuming additionally that the third derivative is Lipschitz, we require only $O(\epsilon^{-5/3} \log(1/ \epsilon) )$ evaluations.

[1] Yair Carmon,et al. Accelerated Methods for Non-Convex Optimization , 2016, SIAM J. Optim..

[2] Xiaodong Li,et al. Phase Retrieval via Wirtinger Flow: Theory and Algorithms , 2014, IEEE Transactions on Information Theory.

[3] Tengyu Ma,et al. Finding Approximate Local Minima for Nonconvex Optimization in Linear Time , 2016, ArXiv.

[4] J. Tukey,et al. The Fitting of Power Series, Meaning Polynomials, Illustrated on Band-Spectroscopic Data , 1974 .

[5] Y. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[6] Emmanuel J. Candès,et al. Adaptive Restart for Accelerated Gradient Schemes , 2012, Foundations of Computational Mathematics.

[7] Michael I. Jordan,et al. Gradient Descent Only Converges to Minimizers , 2016, COLT.

[8] Tengyu Ma,et al. Finding approximate local minima faster than gradient descent , 2016, STOC.

[9] Nicholas I. M. Gould,et al. On the Complexity of Steepest Descent, Newton's and Regularized Newton's Methods for Nonconvex Unconstrained Optimization Problems , 2010, SIAM J. Optim..

[10] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11] W. Hager,et al. A SURVEY OF NONLINEAR CONJUGATE GRADIENT METHODS , 2005 .

[12] Yurii Nesterov,et al. Squared Functional Systems and Optimization Problems , 2000 .

[13] Oriol Vinyals,et al. Qualitatively characterizing neural network optimization problems , 2014, ICLR.

[14] Yehuda Koren,et al. Matrix Factorization Techniques for Recommender Systems , 2009, Computer.

[15] Marc Teboulle,et al. Gradient-based algorithms with applications to signal-recovery problems , 2010, Convex Optimization in Signal Processing and Communications.

[16] John Darzentas,et al. Problem Complexity and Method Efficiency in Optimization , 1983 .

[17] Guillermo Sapiro,et al. Supervised Dictionary Learning , 2008, NIPS.

[18] D K Smith,et al. Numerical Optimization , 2001, J. Oper. Res. Soc..

[19] Jorge Nocedal,et al. On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[20] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[21] Katta G. Murty,et al. Some NP-complete problems in quadratic and nonlinear programming , 1987, Math. Program..

[22] Convex Optimization in Signal Processing and Communications , 2010 .

[23] Yurii Nesterov,et al. Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[24] Sébastien Bubeck,et al. Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[25] Yann LeCun,et al. The mnist database of handwritten digits , 2005 .

[26] Geoffrey E. Hinton,et al. Deep Learning , 2015, Nature.

[27] Yonina C. Eldar,et al. Solving Systems of Random Quadratic Equations via Truncated Amplitude Flow , 2016, IEEE Transactions on Information Theory.

[28] Zeyuan Allen-Zhu,et al. How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD , 2018, NeurIPS.

[29] Geoffrey E. Hinton,et al. Learning internal representations by error propagation , 1986 .

[30] Yurii Nesterov,et al. Cubic regularization of Newton method and its global performance , 2006, Math. Program..