SVRG Escapes Saddle Points

In this paper, we consider a fundamental problem of non-convex optimization that has wide applications and implications in machine learning. Previous works have shown that stochastic gradient descent with the variance reduction technique (SVRG) converges to a first-order stationary point in O( 2 3 2 ) iterations in non-convex settings. However, many problems in non-convex optimization requires an efficient way to find secondorder stationary points that are a subset of first-order stationary points, and therefore algorithms that converges to a first-order stationary point are not satisfying. This paper shows how SVRG converges to a second-order stationary point in non-convex setting with a uniformly random perturbation. To find an -second-order stationary point, it takes O( n 3 4 polylog( d n ) 2 ) iterations. In comparison with the convergence rate of SVRG to a first-order stationary point, the loss in convergence rate only depends polylogarithmically on the dimension d and involves a small polynomial of n, O(n 1 12 ). This is the best known result in finding second-order stationary point. We also give some intuitions and proof sketches to a new framework of analysis using the stable manifold theorem. The analysis from the new framework may help to eliminate the n 1 12 loss from our current analysis. Our results can be directly applied to problems such as the training of neural networks and tensor decomposition. The intuition of stable manifold may provide independent interest to the non-convex optimization community.

[1]  Yair Carmon,et al.  "Convex Until Proven Guilty": Dimension-Free Acceleration of Gradient Descent on Non-Convex Functions , 2017, ICML.

[2]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[3]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[4]  Tengyu Ma,et al.  Matrix Completion has No Spurious Local Minimum , 2016, NIPS.

[5]  Yair Carmon,et al.  Accelerated Methods for Non-Convex Optimization , 2016, SIAM J. Optim..

[6]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[7]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[8]  Jian Li,et al.  A Simple Proximal Stochastic Gradient Method for Nonsmooth Nonconvex Optimization , 2018, NeurIPS.

[9]  Kfir Y. Levy,et al.  The Power of Normalization: Faster Evasion of Saddle Points , 2016, ArXiv.

[10]  Saeed Ghadimi,et al.  Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2013, Mathematical Programming.

[11]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[12]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[13]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[14]  Yuanzhi Li,et al.  Neon2: Finding Local Minima via First-Order Oracles , 2017, NeurIPS.

[15]  Tengyu Ma,et al.  Finding approximate local minima faster than gradient descent , 2016, STOC.

[16]  Prateek Jain,et al.  Phase Retrieval Using Alternating Minimization , 2013, IEEE Transactions on Signal Processing.

[17]  Prateek Jain,et al.  Global Convergence of Non-Convex Gradient Descent for Computing Matrix Squareroot , 2015, AISTATS.

[18]  M. Shub Global Stability of Dynamical Systems , 1986 .

[19]  Zeyuan Allen Zhu,et al.  Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[20]  John Wright,et al.  A Geometric Analysis of Phase Retrieval , 2016, International Symposium on Information Theory.