Gradient Descent Can Take Exponential Time to Escape Saddle Points

Although gradient descent (GD) almost always escapes saddle points asymptotically [Lee et al., 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape. On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et al., 2017] is not slowed down by saddle points - it can find an approximate local minimizer in polynomial time. This result implies that GD is inherently slower than perturbed GD, and justifies the importance of adding perturbations for efficient non-convex optimization. While our focus is theoretical, we also present experiments that illustrate our theoretical findings.

[1]  H. Whitney Analytic Extensions of Differentiable Functions Defined in Closed Sets , 1934 .

[2]  J. Palis,et al.  Geometric theory of dynamical systems : an introduction , 1984 .

[3]  A. Edelman,et al.  Nonnegativity-, monotonicity-, or convexity-preserving cubic and quintic Hermite interpolation , 1989 .

[4]  R. Pemantle,et al.  Nonconvergence to Unstable Points in Urn Models and Stochastic Approximations , 1990 .

[5]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[6]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[7]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[8]  Moritz Hardt,et al.  Understanding Alternating Minimization for Matrix Completion , 2013, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[9]  Alan J. Chang The Whitney extension theorem in high dimensions , 2015, 1508.01779.

[10]  Zhi-Quan Luo,et al.  Guaranteed Matrix Completion via Non-Convex Factorization , 2014, IEEE Transactions on Information Theory.

[11]  Prateek Jain,et al.  Phase Retrieval Using Alternating Minimization , 2013, IEEE Transactions on Signal Processing.

[12]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[13]  Xiaodong Li,et al.  Phase Retrieval via Wirtinger Flow: Theory and Algorithms , 2014, IEEE Transactions on Information Theory.

[14]  Nathan Srebro,et al.  Tight Complexity Bounds for Optimizing Composite Objectives , 2016, NIPS.

[15]  John Wright,et al.  A Geometric Analysis of Phase Retrieval , 2016, International Symposium on Information Theory.

[16]  Nathan Srebro,et al.  Global Optimality of Local Search for Low Rank Matrix Recovery , 2016, NIPS.

[17]  Yair Carmon,et al.  Accelerated Methods for Non-Convex Optimization , 2016, SIAM J. Optim..

[18]  Kfir Y. Levy,et al.  The Power of Normalization: Faster Evasion of Saddle Points , 2016, ArXiv.

[19]  Yair Carmon,et al.  Gradient Descent Efficiently Finds the Cubic-Regularized Non-Convex Newton Step , 2016, ArXiv.

[20]  Michael I. Jordan,et al.  Gradient Descent Converges to Minimizers , 2016, ArXiv.

[21]  Constantine Caramanis,et al.  Fast Algorithms for Robust PCA via Gradient Descent , 2016, NIPS.

[22]  Michael I. Jordan,et al.  Gradient Descent Only Converges to Minimizers , 2016, COLT.

[23]  Tengyu Ma,et al.  Matrix Completion has No Spurious Local Minimum , 2016, NIPS.

[24]  Elad Hazan,et al.  Finding Local Minima for Nonconvex Optimization in Linear Time , 2016 .

[25]  Junwei Lu,et al.  Symmetry, Saddle Points, and Global Geometry of Nonconvex Matrix Factorization , 2016, ArXiv.

[26]  Yi Zheng,et al.  No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis , 2017, ICML.

[27]  Xiao Zhang,et al.  Stochastic Variance-reduced Gradient Descent for Low-rank Matrix Recovery from Linear Measurements , 2017, 1701.00481.

[28]  John Wright,et al.  Complete Dictionary Recovery Over the Sphere I: Overview and the Geometric Picture , 2015, IEEE Transactions on Information Theory.

[29]  Daniel P. Robinson,et al.  A trust region algorithm with a worst-case iteration complexity of O(ϵ-3/2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{docume , 2016, Mathematical Programming.

[30]  Prateek Jain,et al.  Global Convergence of Non-Convex Gradient Descent for Computing Matrix Squareroot , 2015, AISTATS.

[31]  Anastasios Kyrillidis,et al.  Non-square matrix sensing without spurious local minima via the Burer-Monteiro approach , 2016, AISTATS.

[32]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[33]  Tengyu Ma,et al.  Finding approximate local minima faster than gradient descent , 2016, STOC.

[34]  Junwei Lu,et al.  Symmetry. Saddle Points, and Global Optimization Landscape of Nonconvex Matrix Factorization , 2016, 2018 Information Theory and Applications Workshop (ITA).

[35]  Yuandong Tian,et al.  When is a Convolutional Filter Easy To Learn? , 2017, ICLR.