论文信息 - Escaping Saddle-Points Faster under Interpolation-like Conditions - 字舞流文

Escaping Saddle-Points Faster under Interpolation-like Conditions

In this paper, we show that under over-parametrization several standard stochastic optimization algorithms escape saddle-points and converge to local-minimizers much faster. One of the fundamental aspects of over-parametrized models is that they are capable of interpolating the training data. We show that, under interpolation-like assumptions satisfied by the stochastic gradients in an over-parametrization setting, the first-order oracle complexity of Perturbed Stochastic Gradient Descent (PSGD) algorithm to reach an $\epsilon$-local-minimizer, matches the corresponding deterministic rate of $\tilde{\mathcal{O}}(1/\epsilon^{2})$. We next analyze Stochastic Cubic-Regularized Newton (SCRN) algorithm under interpolation-like conditions, and show that the oracle complexity to reach an $\epsilon$-local-minimizer under interpolation-like conditions, is $\tilde{\mathcal{O}}(1/\epsilon^{2.5})$. While this obtained complexity is better than the corresponding complexity of either PSGD, or SCRN without interpolation-like assumptions, it does not match the rate of $\tilde{\mathcal{O}}(1/\epsilon^{1.5})$ corresponding to deterministic Cubic-Regularized Newton method. It seems further Hessian-based interpolation-like assumptions are necessary to bridge this gap. We also discuss the corresponding improved complexities in the zeroth-order settings.

Abhishek Roy | Krishnakumar Balasubramanian | Saeed Ghadimi | Prasant Mohapatra | Saeed Ghadimi | P. Mohapatra | K. Balasubramanian | Abhishek Roy

[1] Shuzhong Zhang,et al. Adaptive Stochastic Variance Reduction for Subsampled Newton Method with Cubic Regularization , 2018, INFORMS Journal on Optimization.

[2] Francis Bach,et al. On Lazy Training in Differentiable Programming , 2018, NeurIPS.

[3] Zhouchen Lin,et al. Sharp Analysis for Nonconvex SGD Escaping from Saddle Points , 2019, COLT.

[4] Meisam Razaviyayn,et al. A Trust Region Method for Finding Second-Order Stationarity in Linearly Constrained Nonconvex Optimization , 2019, SIAM J. Optim..

[5] Liwei Wang,et al. Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[6] Saeed Ghadimi,et al. Non-asymptotic Results for Langevin Monte Carlo: Coordinate-wise and Black-box Sampling , 2019, 1902.01373.

[7] Bo Yang,et al. SNAP: Finding Approximate Second-Order Stationary Solutions Efficiently for Non-convex Linearly Constrained Problems , 2019, ArXiv.

[8] Yuan Cao,et al. Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[9] Nicholas I. M. Gould,et al. Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function- and derivative-evaluation complexity , 2011, Math. Program..

[10] Yurii Nesterov,et al. Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[11] Arthur Jacot,et al. Neural tangent kernel: convergence and generalization in neural networks (invited paper) , 2018, NeurIPS.

[12] Mark W. Schmidt,et al. Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron , 2018, AISTATS.

[13] Quanquan Gu,et al. Stochastic Recursive Variance-Reduced Cubic Regularization Methods , 2019, AISTATS.

[14] Michael I. Jordan,et al. Gradient Descent Can Take Exponential Time to Escape Saddle Points , 2017, NIPS.

[15] R. Vershynin,et al. A Randomized Kaczmarz Algorithm with Exponential Convergence , 2007, math/0702226.

[16] Francis Bach,et al. On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[17] Krishnakumar Balasubramanian,et al. Multi-Point Bandit Algorithms for Nonstationary Online Nonconvex Optimization , 2019, ArXiv.

[18] Songtao Lu,et al. Perturbed Projected Gradient Descent Converges to Approximate Second-order Points for Bound Constrained Nonconvex Problems , 2019, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19] Krishnakumar Balasubramanian,et al. Zeroth-Order Nonconvex Stochastic Optimization: Handling Constraints, High Dimensionality, and Saddle Points , 2018, Foundations of Computational Mathematics.

[20] Vaneet Aggarwal,et al. Escaping Saddle Points for Zeroth-order Non-convex Optimization using Estimated Gradient Descent , 2019, 2020 54th Annual Conference on Information Sciences and Systems (CISS).

[21] Yuanzhi Li,et al. A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[22] Michael I. Jordan,et al. On Nonconvex Optimization for Machine Learning , 2019, J. ACM.

[23] Ruoyu Sun,et al. Optimization for deep learning: theory and algorithms , 2019, ArXiv.

[24] Michael I. Jordan,et al. First-order methods almost always avoid saddle points: The case of vanishing step-sizes , 2019, NeurIPS.

[25] Yurii Nesterov,et al. Random Gradient-Free Minimization of Convex Functions , 2015, Foundations of Computational Mathematics.

[26] Surya Ganguli,et al. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[27] Raef Bassily,et al. The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning , 2017, ICML.

[28] Raef Bassily,et al. On exponential convergence of SGD in non-convex over-parametrized learning , 2018, ArXiv.

[29] Leslie Pack Kaelbling,et al. Elimination of All Bad Local Minima in Deep Learning , 2019, AISTATS.

[30] Michael I. Jordan,et al. Gradient Descent Only Converges to Minimizers , 2016, COLT.

[31] Michael I. Jordan,et al. How to Escape Saddle Points Efficiently , 2017, ICML.

[32] Kenji Kawaguchi,et al. Deep Learning without Poor Local Minima , 2016, NIPS.

[33] Léon Bottou,et al. On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , 2018, NeurIPS.

[34] Zeyuan Allen Zhu,et al. Katyusha: the first direct acceleration of stochastic gradient methods , 2017, STOC.

[35] René Vidal,et al. Global Optimality in Tensor Factorization, Deep Learning, and Beyond , 2015, ArXiv.

[36] Furong Huang,et al. Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[37] Deanna Needell,et al. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[38] Michael I. Jordan,et al. Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent , 2017, COLT.

[39] Yair Carmon,et al. Gradient Descent Efficiently Finds the Cubic-Regularized Non-Convex Newton Step , 2016, ArXiv.

[40] Saeed Ghadimi,et al. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[41] Tong Zhang,et al. SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.

[42] Aryan Mokhtari,et al. Escaping Saddle Points in Constrained Optimization , 2018, NeurIPS.

[43] René Vidal,et al. Structured Low-Rank Matrix Factorization: Optimality, Algorithm, and Applications to Image Processing , 2014, ICML.

[44] Yuanzhi Li,et al. Neon2: Finding Local Minima via First-Order Oracles , 2017, NeurIPS.

[45] Matthias Hein,et al. The Loss Surface of Deep and Wide Neural Networks , 2017, ICML.

[46] Georgios Piliouras,et al. Efficiently avoiding saddle points with zero order methods: No gradients required , 2019, NeurIPS.

[47] Yi Zhou,et al. Sample Complexity of Stochastic Variance-Reduced Cubic Regularization for Nonconvex Optimization , 2018, AISTATS.

[48] Daniel P. Robinson,et al. A trust region algorithm with a worst-case iteration complexity of O(ϵ-3/2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{docume , 2016, Mathematical Programming.

[49] Michael I. Jordan,et al. Stochastic Cubic Regularization for Fast Nonconvex Optimization , 2017, NeurIPS.