Adaptive Stochastic Gradient Langevin Dynamics: Taming Convergence and Saddle Point Escape Time

In this paper, we propose a new adaptive stochastic gradient Langevin dynamics (ASGLD) algorithmic framework and its two specialized versions, namely adaptive stochastic gradient (ASG) and adaptive gradient Langevin dynamics(AGLD), for non-convex optimization problems. All proposed algorithms can escape from saddle points with at most $O(\log d)$ iterations, which is nearly dimension-free. Further, we show that ASGLD and ASG converge to a local minimum with at most $O(\log d/\epsilon^4)$ iterations. Also, ASGLD with full gradients or ASGLD with a slowly linearly increasing batch size converge to a local minimum with iterations bounded by $O(\log d/\epsilon^2)$, which outperforms existing first-order methods.

[1]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[2]  Tengyu Ma,et al.  Finding approximate local minima faster than gradient descent , 2016, STOC.

[3]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[4]  Shai Shalev-Shwartz,et al.  On Graduated Optimization for Stochastic Non-Convex Problems , 2015, ICML.

[5]  Yair Carmon,et al.  Accelerated Methods for Non-Convex Optimization , 2016, SIAM J. Optim..

[6]  Tengyu Ma,et al.  Finding Approximate Local Minima for Nonconvex Optimization in Linear Time , 2016, ArXiv.

[7]  Quoc V. Le,et al.  Adding Gradient Noise Improves Learning for Very Deep Networks , 2015, ArXiv.

[8]  Michael I. Jordan,et al.  Gradient Descent Only Converges to Minimizers , 2016, COLT.

[9]  Tony R. Martinez,et al.  Iterative Non-linear Dimensionality Reduction with Manifold Sculpting , 2007, NIPS.

[10]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[11]  Yair Carmon,et al.  Gradient Descent Efficiently Finds the Cubic-Regularized Non-Convex Newton Step , 2016, ArXiv.

[12]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[13]  Kfir Y. Levy,et al.  The Power of Normalization: Faster Evasion of Saddle Points , 2016, ArXiv.

[14]  A. M. Mathai,et al.  Quadratic forms in random variables : theory and applications , 1992 .

[15]  Inderjit S. Dhillon,et al.  Rank minimization via online learning , 2008, ICML '08.

[16]  G. McLachlan,et al.  Extensions of the EM Algorithm , 2007 .

[17]  Balas K. Natarajan,et al.  Sparse Approximate Solutions to Linear Systems , 1995, SIAM J. Comput..

[18]  Yuchen Zhang,et al.  A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics , 2017, COLT.

[19]  Nicholas I. M. Gould,et al.  Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results , 2011, Math. Program..

[20]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[21]  Haihao Lu,et al.  Depth Creates No Bad Local Minima , 2017, ArXiv.

[22]  Tianbao Yang,et al.  First-order Stochastic Algorithms for Escaping From Saddle Points in Almost Linear Time , 2017, NeurIPS.

[23]  Jinghui Chen,et al.  Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization , 2017, NeurIPS.

[24]  Martin J. Wainwright,et al.  Fast low-rank estimation by projected gradient descent: General statistical and algorithmic guarantees , 2015, ArXiv.

[25]  Ronald L. Rivest,et al.  Training a 3-node neural network is NP-complete , 1988, COLT '88.

[26]  Emmanuel J. Candès,et al.  Exact Matrix Completion via Convex Optimization , 2009, Found. Comput. Math..

[27]  Matus Telgarsky,et al.  Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis , 2017, COLT.

[28]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[29]  Thomas Laurent,et al.  Deep linear neural networks with arbitrary loss: All local minima are global , 2017, ArXiv.

[30]  Zeyuan Allen-Zhu,et al.  Natasha 2: Faster Non-Convex Optimization Than SGD , 2017, NeurIPS.

[31]  Yee Whye Teh,et al.  Consistency and Fluctuations For Stochastic Gradient Langevin Dynamics , 2014, J. Mach. Learn. Res..

[32]  Z. Zabinsky Random Search Algorithms , 2010 .

[33]  Yonina C. Eldar,et al.  Phase Retrieval via Matrix Completion , 2011, SIAM Rev..

[34]  Chih-Jen Lin,et al.  Projected Gradient Methods for Nonnegative Matrix Factorization , 2007, Neural Computation.

[35]  Lukasz Kaiser,et al.  Neural GPUs Learn Algorithms , 2015, ICLR.

[36]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[37]  Daniel P. Robinson,et al.  A trust region algorithm with a worst-case iteration complexity of O(ϵ-3/2)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{docume , 2016, Mathematical Programming.