论文信息 - On Stationary-Point Hitting Time and Ergodicity of Stochastic Gradient Langevin Dynamics - 字舞流文

On Stationary-Point Hitting Time and Ergodicity of Stochastic Gradient Langevin Dynamics

Stochastic gradient Langevin dynamics (SGLD) is a fundamental algorithm in stochastic optimization. Recent work by Zhang et al. [2017] presents an analysis for the hitting time of SGLD for the first and second order stationary points. The proof in Zhang et al. [2017] is a two-stage procedure through bounding the Cheeger's constant, which is rather complicated and leads to loose bounds. In this paper, using intuitions from stochastic differential equations, we provide a direct analysis for the hitting times of SGLD to the first and second order stationary points. Our analysis is straightforward. It only relies on basic linear algebra and probability theory tools. Our direct analysis also leads to tighter bounds comparing to Zhang et al. [2017] and shows the explicit dependence of the hitting time on different factors, including dimensionality, smoothness, noise strength, and step size effects. Under suitable conditions, we show that the hitting time of SGLD to first-order stationary points can be dimension-independent. Moreover, we apply our analysis to study several important online estimation problems in machine learning, including linear regression, matrix factorization, and online PCA.

Xi Chen | Simon S. Du | Xin T. Tong | S. Du | Xi Chen | X. T. Tong

[1] Wei Hu,et al. Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced , 2018, NeurIPS.

[2] Yaodong Yu,et al. Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima , 2018, NeurIPS.

[3] Yuchen Zhang,et al. A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics , 2017, COLT.

[4] Michael I. Jordan,et al. Gradient Descent Can Take Exponential Time to Escape Saddle Points , 2017, NIPS.

[5] Ahn,et al. Bayesian posterior sampling via stochastic gradient Fisher scoring Bayesian Posterior Sampling via Stochastic Gradient Fisher Scoring , 2012 .

[6] Zeyuan Allen-Zhu,et al. Natasha 2: Faster Non-Convex Optimization Than SGD , 2017, NeurIPS.

[7] Kfir Y. Levy,et al. The Power of Normalization: Faster Evasion of Saddle Points , 2016, ArXiv.

[8] Matus Telgarsky,et al. Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis , 2017, COLT.

[9] Michael I. Jordan,et al. On Nonconvex Optimization for Machine Learning , 2019, J. ACM.

[10] Arnak S. Dalalyan,et al. User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient , 2017, Stochastic Processes and their Applications.

[11] Quoc V. Le,et al. Adding Gradient Noise Improves Learning for Very Deep Networks , 2015, ArXiv.

[12] Thomas Hofmann,et al. Escaping Saddles with Stochastic Gradients , 2018, ICML.

[13] Tengyu Ma,et al. Identity Matters in Deep Learning , 2016, ICLR.

[14] Hermann Ney,et al. A comprehensive study of deep bidirectional LSTM RNNS for acoustic modeling in speech recognition , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[15] Eric Moulines,et al. Sampling from a log-concave distribution with compact support with proximal Langevin Monte Carlo , 2017, COLT.

[16] Arnak S. Dalalyan,et al. Further and stronger analogy between sampling and optimization: Langevin Monte Carlo and gradient descent , 2017, COLT.

[17] Lawrence Carin,et al. On the Convergence of Stochastic Gradient MCMC Algorithms with High-Order Integrators , 2015, NIPS.

[18] Jonathan C. Mattingly,et al. Ergodicity for SDEs and approximations: locally Lipschitz vector fields and degenerate noise , 2002 .

[19] Richard L. Tweedie,et al. Markov Chains and Stochastic Stability , 1993, Communications and Control Engineering Series.

[20] Michael I. Jordan,et al. How to Escape Saddle Points Efficiently , 2017, ICML.

[21] Kenji Kawaguchi,et al. Deep Learning without Poor Local Minima , 2016, NIPS.

[22] Daniel P. Robinson,et al. A trust region algorithm with a worst-case iteration complexity of O(ϵ-3/2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{docume , 2016, Mathematical Programming.

[23] Nathan Srebro,et al. Global Optimality of Local Search for Low Rank Matrix Recovery , 2016, NIPS.

[24] Zeyuan Allen Zhu,et al. Variance Reduction for Faster Non-Convex Optimization , 2016, ICML.

[25] Tengyu Ma,et al. Learning One-hidden-layer Neural Networks with Landscape Design , 2017, ICLR.

[26] A. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log‐concave densities , 2014, 1412.7392.

[27] Kai Zheng,et al. Generalization Bounds of SGLD for Non-convex Learning: Two Theoretical Viewpoints , 2017, COLT.

[28] Tianbao Yang,et al. First-order Stochastic Algorithms for Escaping From Saddle Points in Almost Linear Time , 2017, NeurIPS.

[29] Jinghui Chen,et al. Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization , 2017, NeurIPS.

[30] Yurii Nesterov,et al. Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[31] Suvrit Sra,et al. A Critical View of Global Optimality in Deep Learning , 2018, ArXiv.

[32] Yuanzhi Li,et al. Neon2: Finding Local Minima via First-Order Oracles , 2017, NeurIPS.

[33] Yee Whye Teh,et al. Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[34] Tengyu Ma,et al. Finding approximate local minima faster than gradient descent , 2016, STOC.

[35] Yi Zheng,et al. No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis , 2017, ICML.

[36] Jing Dong,et al. Replica Exchange for Non-Convex Optimization , 2020, J. Mach. Learn. Res..

[37] Michael I. Jordan,et al. Gradient Descent Only Converges to Minimizers , 2016, COLT.

[38] Andrea Montanari,et al. Solving SDPs for synchronization and MaxCut problems via the Grothendieck inequality , 2017, COLT.

[39] Tianqi Chen,et al. A Complete Recipe for Stochastic Gradient MCMC , 2015, NIPS.

[40] Lukasz Kaiser,et al. Neural GPUs Learn Algorithms , 2015, ICLR.

[41] É. Moulines,et al. Non-asymptotic convergence analysis for the Unadjusted Langevin Algorithm , 2015, 1507.05021.

[42] Alexander J. Smola,et al. Variance Reduction in Stochastic Gradient Langevin Dynamics , 2016, NIPS.

[43] Alexander J. Smola,et al. Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[44] John Wright,et al. Complete Dictionary Recovery Over the Sphere I: Overview and the Geometric Picture , 2015, IEEE Transactions on Information Theory.

[45] Furong Huang,et al. Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[46] Maxim Raginsky,et al. Local Optimality and Generalization Guarantees for the Langevin Algorithm via Empirical Metastability , 2018, COLT.

[47] Anastasios Kyrillidis,et al. Non-square matrix sensing without spurious local minima via the Burer-Monteiro approach , 2016, AISTATS.

[48] Yair Carmon,et al. Accelerated Methods for NonConvex Optimization , 2018, SIAM J. Optim..

[49] Zhouchen Lin,et al. Sharp Analysis for Nonconvex SGD Escaping from Saddle Points , 2019, COLT.

[50] Jason D. Lee,et al. On the Power of Over-parametrization in Neural Networks with Quadratic Activation , 2018, ICML.

[51] Andrew M. Stuart,et al. Inverse problems: A Bayesian perspective , 2010, Acta Numerica.

[52] Marcin Andrychowicz,et al. Neural Random Access Machines , 2015, ERCIM News.

[53] Quoc V. Le,et al. Neural Programmer: Inducing Latent Programs with Gradient Descent , 2015, ICLR.

[54] Michael I. Jordan,et al. Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent , 2017, COLT.

[55] Yair Carmon,et al. Gradient Descent Efficiently Finds the Cubic-Regularized Non-Convex Newton Step , 2016, ArXiv.

[56] Saeed Ghadimi,et al. Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2013, Mathematical Programming.

[57] Yuandong Tian,et al. Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima , 2017, ICML.

[58] Sébastien Bubeck,et al. Sampling from a Log-Concave Distribution with Projected Langevin Monte Carlo , 2015, Discrete & Computational Geometry.