Escaping Saddle-Points Faster under Interpolation-like Conditions

In this paper, we show that under over-parametrization several standard stochastic optimization algorithms escape saddle-points and converge to local-minimizers much faster. One of the fundamental aspects of over-parametrized models is that they are capable of interpolating the training data. We show that, under interpolation-like assumptions satisfied by the stochastic gradients in an over-parametrization setting, the first-order oracle complexity of Perturbed Stochastic Gradient Descent (PSGD) algorithm to reach an $\epsilon$-local-minimizer, matches the corresponding deterministic rate of $\tilde{\mathcal{O}}(1/\epsilon^{2})$. We next analyze Stochastic Cubic-Regularized Newton (SCRN) algorithm under interpolation-like conditions, and show that the oracle complexity to reach an $\epsilon$-local-minimizer under interpolation-like conditions, is $\tilde{\mathcal{O}}(1/\epsilon^{2.5})$. While this obtained complexity is better than the corresponding complexity of either PSGD, or SCRN without interpolation-like assumptions, it does not match the rate of $\tilde{\mathcal{O}}(1/\epsilon^{1.5})$ corresponding to deterministic Cubic-Regularized Newton method. It seems further Hessian-based interpolation-like assumptions are necessary to bridge this gap. We also discuss the corresponding improved complexities in the zeroth-order settings.

[1]  Shuzhong Zhang,et al.  Adaptive Stochastic Variance Reduction for Subsampled Newton Method with Cubic Regularization , 2018, INFORMS Journal on Optimization.

[2]  Francis Bach,et al.  On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[3]  Zhouchen Lin,et al.  Sharp Analysis for Nonconvex SGD Escaping from Saddle Points , 2019, COLT.

[4]  Meisam Razaviyayn,et al.  A Trust Region Method for Finding Second-Order Stationarity in Linearly Constrained Nonconvex Optimization , 2019, SIAM J. Optim..

[5]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[6]  Saeed Ghadimi,et al.  Non-asymptotic Results for Langevin Monte Carlo: Coordinate-wise and Black-box Sampling , 2019, 1902.01373.

[7]  Bo Yang,et al.  SNAP: Finding Approximate Second-Order Stationary Solutions Efficiently for Non-convex Linearly Constrained Problems , 2019, ArXiv.

[8]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[9]  Nicholas I. M. Gould,et al.  Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function- and derivative-evaluation complexity , 2011, Math. Program..

[10]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[11]  Arthur Jacot,et al.  Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[12]  Mark W. Schmidt,et al.  Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron , 2018, AISTATS.

[13]  Quanquan Gu,et al.  Stochastic Recursive Variance-Reduced Cubic Regularization Methods , 2019, AISTATS.

[14]  Michael I. Jordan,et al.  Gradient Descent Can Take Exponential Time to Escape Saddle Points , 2017, NIPS.

[15]  R. Vershynin,et al.  A Randomized Kaczmarz Algorithm with Exponential Convergence , 2007, math/0702226.

[16]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[17]  Krishnakumar Balasubramanian,et al.  Multi-Point Bandit Algorithms for Nonstationary Online Nonconvex Optimization , 2019, ArXiv.

[18]  Songtao Lu,et al.  Perturbed Projected Gradient Descent Converges to Approximate Second-order Points for Bound Constrained Nonconvex Problems , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Krishnakumar Balasubramanian,et al.  Zeroth-Order Nonconvex Stochastic Optimization: Handling Constraints, High Dimensionality, and Saddle Points , 2018, Foundations of Computational Mathematics.

[20]  Vaneet Aggarwal,et al.  Escaping Saddle Points for Zeroth-order Non-convex Optimization using Estimated Gradient Descent , 2019, 2020 54th Annual Conference on Information Sciences and Systems (CISS).

[21]  Yuanzhi Li,et al.  A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[22]  Michael I. Jordan,et al.  On Nonconvex Optimization for Machine Learning , 2019, J. ACM.

[23]  Ruoyu Sun,et al.  Optimization for deep learning: theory and algorithms , 2019, ArXiv.

[24]  Michael I. Jordan,et al.  First-order methods almost always avoid saddle points: The case of vanishing step-sizes , 2019, NeurIPS.

[25]  Yurii Nesterov,et al.  Random Gradient-Free Minimization of Convex Functions , 2015, Foundations of Computational Mathematics.

[26]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[27]  Raef Bassily,et al.  The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning , 2017, ICML.

[28]  Raef Bassily,et al.  On exponential convergence of SGD in non-convex over-parametrized learning , 2018, ArXiv.

[29]  Leslie Pack Kaelbling,et al.  Elimination of All Bad Local Minima in Deep Learning , 2019, AISTATS.

[30]  Michael I. Jordan,et al.  Gradient Descent Only Converges to Minimizers , 2016, COLT.

[31]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[32]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[33]  Léon Bottou,et al.  On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , 2018, NeurIPS.

[34]  Zeyuan Allen Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2017, STOC.

[35]  René Vidal,et al.  Global Optimality in Tensor Factorization, Deep Learning, and Beyond , 2015, ArXiv.

[36]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[37]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[38]  Michael I. Jordan,et al.  Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent , 2017, COLT.

[39]  Yair Carmon,et al.  Gradient Descent Efficiently Finds the Cubic-Regularized Non-Convex Newton Step , 2016, ArXiv.

[40]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[41]  Tong Zhang,et al.  SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.

[42]  Aryan Mokhtari,et al.  Escaping Saddle Points in Constrained Optimization , 2018, NeurIPS.

[43]  René Vidal,et al.  Structured Low-Rank Matrix Factorization: Optimality, Algorithm, and Applications to Image Processing , 2014, ICML.

[44]  Yuanzhi Li,et al.  Neon2: Finding Local Minima via First-Order Oracles , 2017, NeurIPS.

[45]  Matthias Hein,et al.  The Loss Surface of Deep and Wide Neural Networks , 2017, ICML.

[46]  Georgios Piliouras,et al.  Efficiently avoiding saddle points with zero order methods: No gradients required , 2019, NeurIPS.

[47]  Yi Zhou,et al.  Sample Complexity of Stochastic Variance-Reduced Cubic Regularization for Nonconvex Optimization , 2018, AISTATS.

[48]  Daniel P. Robinson,et al.  A trust region algorithm with a worst-case iteration complexity of O(ϵ-3/2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{docume , 2016, Mathematical Programming.

[49]  Michael I. Jordan,et al.  Stochastic Cubic Regularization for Fast Nonconvex Optimization , 2017, NeurIPS.