论文信息 - First-order methods almost always avoid saddle points: The case of vanishing step-sizes - 字舞流文

First-order methods almost always avoid saddle points: The case of vanishing step-sizes

We establish that first-order methods avoid saddle points for almost all initializations. Our results apply to a wide variety of first-order methods, including gradient descent, block coordinate descent, mirror descent and variants thereof. The connecting thread is that such algorithms can be studied from a dynamical systems perspective in which appropriate instantiations of the Stable Manifold Theorem allow for a global stability analysis. Thus, neither access to second-order derivative information nor randomness beyond initialization is necessary to provably avoid saddle points.

Michael I. Jordan | Max Simchowitz | Benjamin Recht | Georgios Piliouras | Jason D. Lee | Ioannis Panageas | B. Recht | J. Lee | Ioannis Panageas | G. Piliouras | Max Simchowitz

[1] Philip E. Gill,et al. Newton-type methods for unconstrained and linearly constrained optimization , 1974, Math. Program..

[2] Danny C. Sorensen,et al. On the use of directions of negative curvature in a modified newton method , 1979, Math. Program..

[3] M. Shub. Global Stability of Dynamical Systems , 1986 .

[4] Katta G. Murty,et al. Some NP-complete problems in quadratic and nonlinear programming , 1987, Math. Program..

[5] R. Pemantle,et al. Nonconvergence to Unstable Points in Urn Models and Stochastic Approximations , 1990 .

[6] L. Perko. Differential Equations and Dynamical Systems , 1991 .

[7] P. Mikusinski,et al. An Introduction to Multivariable Analysis from Vector to Manifold , 2001 .

[8] Yurii Nesterov,et al. Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[9] Robert E. Mahony,et al. Optimization Algorithms on Matrix Manifolds , 2007 .

[10] L. Barreira,et al. Stability Of Nonautonomous Differential Equations , 2007 .

[11] Adrian S. Lewis,et al. Alternating Projections on Manifolds , 2008, Math. Oper. Res..

[12] Martin Rasmussen,et al. Computation of nonautonomous invariant and inertial manifolds , 2009, Numerische Mathematik.

[13] Éva Tardos,et al. Multiplicative updates outperform generic no-regret learning in congestion games: extended abstract , 2009, STOC '09.

[14] Andrea Montanari,et al. Matrix completion from a few entries , 2009, 2009 IEEE International Symposium on Information Theory.

[15] Antonio Auffinger,et al. Random Matrices and Complexity of Spin Glasses , 2010, 1003.1129.

[16] Jérôme Malick,et al. Projection-like Retractions on Matrix Manifolds , 2012, SIAM J. Optim..

[17] Sanjeev Arora,et al. The Multiplicative Weights Update Method: a Meta-Algorithm and Applications , 2012, Theory Comput..

[18] Robert E. Mahony,et al. An Extrinsic Look at the Riemannian Hessian , 2013, GSI.

[19] A. Latif. Banach Contraction Principle and Its Generalizations , 2014 .

[20] Surya Ganguli,et al. On the saddle point problem for non-convex optimization , 2014, ArXiv.

[21] Sébastien Bubeck,et al. Theory of Convex Optimization for Machine Learning , 2014, ArXiv.

[22] Surya Ganguli,et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[23] Peter Richtárik,et al. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function , 2011, Mathematical Programming.

[24] Shai Ben-David,et al. Understanding Machine Learning: From Theory to Algorithms , 2014 .

[25] Xi Chen,et al. Spectral Methods Meet EM: A Provably Optimal Algorithm for Crowdsourcing , 2014, J. Mach. Learn. Res..

[26] Ruta Mehta,et al. Natural Selection as an Inhibitor of Genetic Diversity: Multiplicative Weights Updates Algorithm and a Conjecture of Haploid Genetics [Working Paper Abstract] , 2014, ITCS.

[27] T. Zhao,et al. Nonconvex Low Rank Matrix Factorization via Inexact First Order Oracle , 2015 .

[28] Sanjeev Arora,et al. Simple, Efficient, and Neural Algorithms for Sparse Coding , 2015, COLT.

[29] Xiaodong Li,et al. Optimal Rates of Convergence for Noisy Sparse Phase Retrieval via Thresholded Wirtinger Flow , 2015, ArXiv.

[30] Furong Huang,et al. Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[31] Yann LeCun,et al. The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[32] Xiaodong Li,et al. Phase Retrieval via Wirtinger Flow: Theory and Algorithms , 2014, IEEE Transactions on Information Theory.

[33] John Wright,et al. When Are Nonconvex Problems Not Scary? , 2015, ArXiv.

[34] John Wright,et al. A Geometric Analysis of Phase Retrieval , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[35] Mikhail Belkin,et al. Basis Learning as an Algorithmic Primitive , 2014, COLT.

[36] Nathan Srebro,et al. Global Optimality of Local Search for Low Rank Matrix Recovery , 2016, NIPS.

[37] Michael I. Jordan,et al. Gradient Descent Only Converges to Minimizers , 2016, COLT.

[38] Tengyu Ma,et al. Matrix Completion has No Spurious Local Minimum , 2016, NIPS.

[39] Georgios Piliouras,et al. Average Case Performance of Replicator Dynamics in Potential Games via Computing Regions of Attraction , 2014, EC.

[40] Georgios Piliouras,et al. Gradient Descent Only Converges to Minimizers: Non-Isolated Critical Points and Invariant Regions , 2016, ITCS.

[41] Michael I. Jordan,et al. Gradient Descent Can Take Exponential Time to Escape Saddle Points , 2017, NIPS.

[42] Yi Zheng,et al. No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis , 2017, ICML.

[43] John Wright,et al. Complete Dictionary Recovery Over the Sphere I: Overview and the Geometric Picture , 2015, IEEE Transactions on Information Theory.

[44] Georgios Piliouras,et al. Multiplicative Weights Update with Constant Step-Size in Congestion Games: Convergence, Limit Cycles and Chaos , 2017, NIPS.

[45] John Wright,et al. Complete Dictionary Recovery Over the Sphere II: Recovery by Riemannian Trust-Region Method , 2015, IEEE Transactions on Information Theory.

[46] Mingrui Liu,et al. On Noisy Negative Curvature Descent: Competing with Gradient Descent for Faster Non-convex Optimization , 2017, 1709.08571.

[47] Michael I. Jordan,et al. How to Escape Saddle Points Efficiently , 2017, ICML.

[48] Amir Globerson,et al. Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[49] S. Shankar Sastry,et al. Step Size Matters in Deep Learning , 2018, NeurIPS.

[50] Stephen J. Wright,et al. Complexity Analysis of Second-Order Line-Search Algorithms for Smooth Nonconvex Optimization , 2017, SIAM J. Optim..

[51] Alexander J. Smola,et al. A Generic Approach for Escaping Saddle points , 2017, AISTATS.

[52] Constantinos Daskalakis,et al. The Limit Points of (Optimistic) Gradient Descent in Min-Max Optimization , 2018, NeurIPS.

[53] G. Piliouras,et al. Family of chaotic maps from game theory , 2018, 1807.06831.

[54] Yuandong Tian,et al. When is a Convolutional Filter Easy To Learn? , 2017, ICLR.

[55] G. Piliouras,et al. The route to chaos in routing games: Population increase drives period-doubling instability, chaos & inefficiency with Price of Anarchy equal to one , 2019, ArXiv.

[56] Xiao Wang,et al. Multiplicative Weights Updates as a distributed constrained optimization algorithm: Convergence to second-order stationary points almost always , 2018, ICML.

[57] Michael I. Jordan. DYNAMICAL, SYMPLECTIC AND STOCHASTIC PERSPECTIVES ON GRADIENT-BASED OPTIMIZATION , 2019, Proceedings of the International Congress of Mathematicians (ICM 2018).

[58] Adel Javanmard,et al. Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[59] Xiao Wang,et al. Depth-Width Trade-offs for ReLU Networks via Sharkovsky's Theorem , 2019, ICLR.

[60] Ioannis Panageas,et al. On the Analysis of EM for truncated mixtures of two Gaussians , 2019, ALT.