Breaking Reversibility Accelerates Langevin Dynamics for Global Non-Convex Optimization

Langevin dynamics (LD) has been proven to be a powerful technique for optimizing a non-convex objective as an efficient algorithm to find local minima while eventually visiting a global minimum on longer time-scales. LD is based on the first-order Langevin diffusion which is reversible in time. We study two variants that are based on non-reversible Langevin diffusions: the underdamped Langevin dynamics (ULD) and the Langevin dynamics with a non-symmetric drift (NLD). Adopting the techniques of Tzen, Liang and Raginsky (2018) for LD to non-reversible diffusions, we show that for a given local minimum that is within an arbitrary distance from the initialization, with high probability, either the ULD trajectory ends up somewhere outside a small neighborhood of this local minimum within a recurrence time which depends on the smallest eigenvalue of the Hessian at the local minimum or they enter this neighborhood by the recurrence time and stay there for a potentially exponentially long escape time. The ULD algorithms improve upon the recurrence time obtained for LD in Tzen, Liang and Raginsky (2018) with respect to the dependency on the smallest eigenvalue of the Hessian at the local minimum. Similar result and improvement are obtained for the NLD algorithm. We also show that non-reversible variants can exit the basin of attraction of a local minimum faster in discrete time when the objective has two local minima separated by a saddle point and quantify the amount of improvement. Our analysis suggests that non-reversible Langevin algorithms are more efficient to locate a local minimum as well as exploring the state space. Our analysis is based on the quadratic approximation of the objective around a local minimum. As a by-product of our analysis, we obtain optimal mixing rates for quadratic objectives in the 2-Wasserstein distance for two non-reversible Langevin algorithms we consider.

[1]  H. Eyring The Activated Complex in Chemical Reactions , 1935 .

[2]  H. Kramers Brownian motion in a field of force and the diffusion model of chemical reactions , 1940 .

[3]  Alʹbert Nikolaevich Shiri︠a︡ev,et al.  Statistics of random processes , 1977 .

[4]  L. Rogers,et al.  Diffusions, Markov processes, and martingales , 1979 .

[5]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[6]  J. Elgin The Fokker-Planck Equation: Methods of Solution and Applications , 1984 .

[7]  B. Gidas Nonstationary Markov chains and convergence of the annealing algorithm , 1985 .

[8]  Bruce Hajek,et al.  A tutorial survey of theory and applications of simulated annealing , 1985, 1985 24th IEEE Conference on Decision and Control.

[9]  S. Duane,et al.  Hybrid Monte Carlo , 1987 .

[10]  B. Øksendal Stochastic differential equations : an introduction with applications , 1987 .

[11]  M. Gelbrich On a Formula for the L2 Wasserstein Metric between Measures on Euclidean and Hilbert Spaces , 1990 .

[12]  S. Mitter,et al.  Recursive stochastic algorithms for global optimization in R d , 1991 .

[13]  S. Meyn,et al.  Stability of Markovian processes I: criteria for discrete-time Chains , 1992, Advances in Applied Probability.

[14]  T. Lindvall Lectures on the Coupling Method , 1992 .

[15]  C. Hwang,et al.  Accelerating Gaussian Diffusions , 1993 .

[16]  Hai-Tao Fang,et al.  ANNEALING OF ITERATIVE STOCHASTIC SCHEMES , 1997 .

[17]  V. Borkar,et al.  A Strong Approximation Theorem for Stochastic Recursive Algorithms , 1999 .

[18]  Radford M. Neal,et al.  ANALYSIS OF A NONREVERSIBLE MARKOV CHAIN SAMPLER , 2000 .

[19]  Liming Wu Large and moderate deviations and exponential convergence for stochastic damping Hamiltonian systems , 2001 .

[20]  N. Berglund,et al.  Beyond the Fokker-Planck equation: pathwise control of noisy bistable systems , 2001, cond-mat/0110180.

[21]  Jonathan C. Mattingly,et al.  Ergodicity for SDEs and approximations: locally Lipschitz vector fields and degenerate noise , 2002 .

[22]  N. Berglund,et al.  Geometric singular perturbation theory for stochastic differential equations , 2002 .

[23]  F. Nier Quantitative analysis of metastability in reversible diffusion processes via a Witten complex approach. , 2004 .

[24]  F. Hérau,et al.  Isotropic Hypoellipticity and Trend to Equilibrium for the Fokker-Planck Equation with a High-Degree Potential , 2004 .

[25]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[26]  A. Bovier,et al.  Metastability in Reversible Diffusion Processes I: Sharp Asymptotics for Capacities and Exit Times , 2004 .

[27]  C. Hwang,et al.  Accelerating diffusions , 2005, math/0505245.

[28]  Sylvain Maire,et al.  Sequential Control Variates for Functionals of Markov Processes , 2005, SIAM J. Numer. Anal..

[29]  A. Bovier,et al.  Metastability in reversible diffusion processes II. Precise asymptotics for small eigenvalues , 2005 .

[30]  Michael L. Overton,et al.  Optimizing the asymptotic convergence rate of the Diaconis-Holmes-Neal sampler , 2007, Adv. Appl. Math..

[31]  C. Villani Optimal Transport: Old and New , 2008 .

[32]  Devavrat Shah,et al.  Gossip Algorithms , 2009, Found. Trends Netw..

[33]  T. Lelièvre,et al.  Free Energy Computations: A Mathematical Perspective , 2010 .

[34]  Radford M. Neal MCMC Using Hamiltonian Dynamics , 2011, 1206.1901.

[35]  N. Berglund Kramers' law: Validity, derivations and generalisations , 2011, 1106.5799.

[36]  S. Glotzer,et al.  Time-course gait analysis of hemiparkinsonian rats following 6-hydroxydopamine lesion , 2004, Behavioural Brain Research.

[37]  G. Pavliotis,et al.  Optimal Non-reversible Linear Drift for the Convergence to Equilibrium of a Diffusion , 2012, 1212.0876.

[38]  B. Bouchard,et al.  First time to exit of a continuous Itô process: General moment estimates and L1 -convergence rate for discrete time approximations , 2013, 1307.4247.

[39]  Sébastien Bubeck,et al.  Theory of Convex Optimization for Machine Learning , 2014, ArXiv.

[40]  Tianqi Chen,et al.  Stochastic Gradient Hamiltonian Monte Carlo , 2014, ICML.

[41]  K. Spiliopoulos,et al.  Variance reduction for irreversible Langevin samplers and diffusion on graphs , 2014, 1410.0255.

[42]  A. Dalalyan Theoretical guarantees for approximate sampling from smooth and log‐concave densities , 2014, 1412.7392.

[43]  C. Hwang,et al.  Attaining the Optimal Gaussian Diffusion Acceleration , 2014 .

[44]  G. Pavliotis Stochastic Processes and Applications: Diffusion Processes, the Fokker-Planck and Langevin Equations , 2014 .

[45]  F. Bouchet,et al.  Generalisation of the Eyring–Kramers Transition Rate Formula to Irreversible Diffusion Processes , 2015, Annales Henri Poincaré.

[46]  Hariharan Narayanan,et al.  Escaping the Local Minima via Simulated Annealing: Optimization of Approximately Convex Functions , 2015, COLT.

[47]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[48]  É. Moulines,et al.  Non-asymptotic convergence analysis for the Unadjusted Langevin Algorithm , 2015, 1507.05021.

[49]  Avraham Adler,et al.  Lambert-W Function , 2015 .

[50]  B. Leimkuhler,et al.  The computation of averages from equilibrium and nonequilibrium Langevin molecular dynamics , 2013, 1308.5814.

[51]  Lawrence Carin,et al.  On the Convergence of Stochastic Gradient MCMC Algorithms with High-Order Integrators , 2015, NIPS.

[52]  Dimitri P. Bertsekas,et al.  Convex Optimization Algorithms , 2015 .

[53]  A. Guillin,et al.  Optimal linear drift for the speed of convergence of an hypoelliptic diffusion , 2016, 1604.07295.

[54]  Yann LeCun,et al.  Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond , 2016, 1611.07476.

[55]  C. Landim,et al.  Metastability of Nonreversible Random Walks in a Potential Field and the Eyring‐Kramers Transition Rate Formula , 2016, 1605.01009.

[56]  G. Pavliotis,et al.  Variance Reduction Using Nonreversible Langevin Samplers , 2015, Journal of statistical physics.

[57]  Konstantinos Spiliopoulos,et al.  Improving the Convergence of Reversible Samplers , 2016 .

[58]  Umut Simsekli,et al.  Fractional Langevin Monte Carlo: Exploring Levy Driven Stochastic Differential Equations for Markov Chain Monte Carlo , 2017, ICML.

[59]  Matus Telgarsky,et al.  Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis , 2017, COLT.

[60]  Oren Mangoubi,et al.  Rapid Mixing of Hamiltonian Monte Carlo on Strongly Log-Concave Distributions , 2017, 1708.07114.

[61]  Alexandre d'Aspremont,et al.  Integration Methods and Optimization Algorithms , 2017, NIPS.

[62]  Yuchen Zhang,et al.  A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics , 2017, COLT.

[63]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[64]  G. Pavliotis,et al.  Using Perturbed Underdamped Langevin Dynamics to Efficiently Sample from Probability Distributions , 2017, Journal of Statistical Physics.

[65]  G. Pavliotis,et al.  Nonreversible Langevin Samplers: Splitting Schemes, Analysis and Implementation , 2017, 1701.04247.

[66]  Michael I. Jordan,et al.  On the Theory of Variance Reduction for Stochastic Gradient Monte Carlo , 2018, ICML.

[67]  Michael I. Jordan,et al.  Underdamped Langevin MCMC: A non-asymptotic analysis , 2017, COLT.

[68]  C. Landim,et al.  Dirichlet’s and Thomson’s Principles for Non-selfadjoint Elliptic Operators with Application to Non-reversible Metastable Diffusion Processes , 2017, Archive for Rational Mechanics and Analysis.

[69]  Ohad Shamir,et al.  Global Non-convex Optimization with Discretized Diffusions , 2018, NeurIPS.

[70]  A. Montanari,et al.  The landscape of empirical risk for nonconvex losses , 2016, The Annals of Statistics.

[71]  Jinghui Chen,et al.  Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization , 2017, NeurIPS.

[72]  Maxim Raginsky,et al.  Local Optimality and Generalization Guarantees for the Langevin Algorithm via Empirical Metastability , 2018, COLT.

[73]  Michael I. Jordan,et al.  Sharp Convergence Rates for Langevin Dynamics in the Nonconvex Setting , 2018, ArXiv.

[74]  Mert Gürbüzbalaban,et al.  Global Convergence of Stochastic Gradient Hamiltonian Monte Carlo for Non-Convex Stochastic Optimization: Non-Asymptotic Performance Bounds and Momentum-Based Acceleration , 2018, Oper. Res..

[75]  Arnak S. Dalalyan,et al.  On sampling from a log-concave density using kinetic Langevin diffusions , 2018, Bernoulli.

[76]  Michael I. Jordan,et al.  Is There an Analog of Nesterov Acceleration for MCMC? , 2019, ArXiv.

[77]  A. Eberle,et al.  Couplings and quantitative contraction rates for Langevin dynamics , 2017, The Annals of Probability.

[78]  Gaël Richard,et al.  Non-Asymptotic Analysis of Fractional Langevin Monte Carlo for Non-Convex Optimization , 2019, ICML.

[79]  Xi Chen,et al.  On Stationary-Point Hitting Time and Ergodicity of Stochastic Gradient Langevin Dynamics , 2019, J. Mach. Learn. Res..