论文信息 - Gradient Descent Can Take Exponential Time to Escape Saddle Points

Gradient Descent Can Take Exponential Time to Escape Saddle Points

Although gradient descent (GD) almost always escapes saddle points asymptotically [Lee et al., 2016], this paper shows that even with fairly natural random initialization schemes and non-pathological functions, GD can be significantly slowed down by saddle points, taking exponential time to escape. On the other hand, gradient descent with perturbations [Ge et al., 2015, Jin et al., 2017] is not slowed down by saddle points - it can find an approximate local minimizer in polynomial time. This result implies that GD is inherently slower than perturbed GD, and justifies the importance of adding perturbations for efficient non-convex optimization. While our focus is theoretical, we also present experiments that illustrate our theoretical findings.

[1] H. Whitney. Analytic Extensions of Differentiable Functions Defined in Closed Sets , 1934 .

[2] J. Palis,et al. Geometric theory of dynamical systems : an introduction , 1984 .

[3] A. Edelman,et al. Nonnegativity-, monotonicity-, or convexity-preserving cubic and quintic Hermite interpolation , 1989 .

[4] R. Pemantle,et al. Nonconvergence to Unstable Points in Urn Models and Stochastic Approximations , 1990 .

[5] H. Kushner,et al. Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[6] Yurii Nesterov,et al. Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[7] Yurii Nesterov,et al. Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[8] Moritz Hardt,et al. Understanding Alternating Minimization for Matrix Completion , 2013, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[9] Alan J. Chang. The Whitney extension theorem in high dimensions , 2015, 1508.01779.

[10] Zhi-Quan Luo,et al. Guaranteed Matrix Completion via Non-Convex Factorization , 2014, IEEE Transactions on Information Theory.

[11] Prateek Jain,et al. Phase Retrieval Using Alternating Minimization , 2013, IEEE Transactions on Signal Processing.

[12] Furong Huang,et al. Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[13] Xiaodong Li,et al. Phase Retrieval via Wirtinger Flow: Theory and Algorithms , 2014, IEEE Transactions on Information Theory.

[14] Nathan Srebro,et al. Tight Complexity Bounds for Optimizing Composite Objectives , 2016, NIPS.

[15] John Wright,et al. A Geometric Analysis of Phase Retrieval , 2016, International Symposium on Information Theory.

[16] Nathan Srebro,et al. Global Optimality of Local Search for Low Rank Matrix Recovery , 2016, NIPS.

[17] Yair Carmon,et al. Accelerated Methods for Non-Convex Optimization , 2016, SIAM J. Optim..

[18] Kfir Y. Levy,et al. The Power of Normalization: Faster Evasion of Saddle Points , 2016, ArXiv.

[19] Yair Carmon,et al. Gradient Descent Efficiently Finds the Cubic-Regularized Non-Convex Newton Step , 2016, ArXiv.

[20] Michael I. Jordan,et al. Gradient Descent Converges to Minimizers , 2016, ArXiv.

[21] Constantine Caramanis,et al. Fast Algorithms for Robust PCA via Gradient Descent , 2016, NIPS.

[22] Michael I. Jordan,et al. Gradient Descent Only Converges to Minimizers , 2016, COLT.

[23] Tengyu Ma,et al. Matrix Completion has No Spurious Local Minimum , 2016, NIPS.

[24] Elad Hazan,et al. Finding Local Minima for Nonconvex Optimization in Linear Time , 2016 .

[25] Junwei Lu,et al. Symmetry, Saddle Points, and Global Geometry of Nonconvex Matrix Factorization , 2016, ArXiv.

[26] Yi Zheng,et al. No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis , 2017, ICML.

[27] Xiao Zhang,et al. Stochastic Variance-reduced Gradient Descent for Low-rank Matrix Recovery from Linear Measurements , 2017, 1701.00481.

[28] John Wright,et al. Complete Dictionary Recovery Over the Sphere I: Overview and the Geometric Picture , 2015, IEEE Transactions on Information Theory.

[29] Daniel P. Robinson,et al. A trust region algorithm with a worst-case iteration complexity of O(ϵ-3/2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{docume , 2016, Mathematical Programming.

[30] Prateek Jain,et al. Global Convergence of Non-Convex Gradient Descent for Computing Matrix Squareroot , 2015, AISTATS.

[31] Anastasios Kyrillidis,et al. Non-square matrix sensing without spurious local minima via the Burer-Monteiro approach , 2016, AISTATS.

[32] Michael I. Jordan,et al. How to Escape Saddle Points Efficiently , 2017, ICML.

[33] Tengyu Ma,et al. Finding approximate local minima faster than gradient descent , 2016, STOC.

[34] Junwei Lu,et al. Symmetry. Saddle Points, and Global Optimization Landscape of Nonconvex Matrix Factorization , 2016, 2018 Information Theory and Applications Workshop (ITA).

[35] Yuandong Tian,et al. When is a Convolutional Filter Easy To Learn? , 2017, ICLR.