Sampling can be faster than optimization

Significance Modern large-scale data analysis and machine learning applications rely critically on computationally efficient algorithms. There are 2 main classes of algorithms used in this setting—those based on optimization and those based on Monte Carlo sampling. The folk wisdom is that sampling is necessarily slower than optimization and is only warranted in situations where estimates of uncertainty are needed. We show that this folk wisdom is not correct in general—there is a natural class of nonconvex problems for which the computational complexity of sampling algorithms scales linearly with the model dimension while that of optimization algorithms scales exponentially. Optimization algorithms and Monte Carlo sampling algorithms have provided the computational foundations for the rapid growth in applications of statistical machine learning in recent years. There is, however, limited theoretical understanding of the relationships between these 2 kinds of methodology, and limited understanding of relative strengths and weaknesses. Moreover, existing results have been obtained primarily in the setting of convex functions (for optimization) and log-concave functions (for sampling). In this setting, where local properties determine global properties, optimization algorithms are unsurprisingly more efficient computationally than sampling algorithms. We instead examine a class of nonconvex objective functions that arise in mixture modeling and multistable systems. In this nonconvex setting, we find that the computational complexity of sampling algorithms scales linearly with the model dimension while that of optimization algorithms scales exponentially.

[1]  H. Kramers Brownian motion in a field of force and the diffusion model of chemical reactions , 1940 .

[2]  D. Haar,et al.  Statistical Physics , 1971, Nature.

[3]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[4]  J. D. Doll,et al.  Brownian dynamics as smart Monte Carlo simulation , 1978 .

[5]  P. Buser A note on the isoperimetric constant , 1982 .

[6]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[7]  J. Azéma,et al.  Séminaire de Probabilités XIX 1983/84 , 1985 .

[8]  B. Øksendal Stochastic Differential Equations , 1985 .

[9]  H. Peters,et al.  Convex functions on non-convex domains , 1986 .

[10]  D. Stroock,et al.  Logarithmic Sobolev inequalities and stochastic Ising models , 1987 .

[11]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[12]  Y. Amit On rates of convergence of stochastic relaxation for Gaussian and non-Gaussian distributions , 1991 .

[13]  U. Grenander,et al.  Comparing sweep strategies for stochastic relaxation , 1991 .

[14]  Miklós Simonovits,et al.  Random Walks in a Convex Body and an Improved Volume Algorithm , 1993, Random Struct. Algorithms.

[15]  J. Rosenthal Minorization Conditions and Convergence Rates for Markov Chain Monte Carlo , 1995 .

[16]  Nicholas G. Polson,et al.  Sampling from log-concave distributions , 1994 .

[17]  R. Tweedie,et al.  Geometric convergence and central limit theorems for multidimensional Hastings and Metropolis algorithms , 1996 .

[18]  G. Roberts,et al.  Updating Schemes, Correlation Structure, Blocking and Parameterization for the Gibbs Sampler , 1997 .

[19]  Herbert S. Wilf,et al.  Algorithms and Complexity , 1994, Lecture Notes in Computer Science.

[20]  Xiao-Li Meng,et al.  The EM Algorithm—an Old Folk‐song Sung to a Fast New Tune , 1997 .

[21]  M. Ledoux The geometry of Markov diffusion generators , 1998 .

[22]  M. Ledoux Concentration of measure and logarithmic Sobolev inequalities , 1999 .

[23]  C. Villani,et al.  Generalization of an Inequality by Talagrand and Links with the Logarithmic Sobolev Inequality , 2000 .

[24]  Geoffrey J. McLachlan,et al.  Finite Mixture Models , 2019, Annual Review of Statistics and Its Application.

[25]  J. Rosenthal,et al.  Optimal scaling for various Metropolis-Hastings algorithms , 2001 .

[26]  G. Roberts,et al.  Langevin Diffusions and Metropolis-Hastings Algorithms , 2002 .

[27]  J. Rosenthal QUANTITATIVE CONVERGENCE RATES OF MARKOV CHAINS: A SIMPLE ACCOUNT , 2002 .

[28]  Santosh S. Vempala,et al.  A spectral algorithm for learning mixture models , 2004, J. Comput. Syst. Sci..

[29]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[30]  Jean-Michel Marin,et al.  Bayesian Modelling and Inference on Mixtures of Distributions , 2005 .

[31]  B. Zegarliński,et al.  Entropy Bounds and Isoperimetry , 2005 .

[32]  S. Bobkov,et al.  Modified Logarithmic Sobolev Inequalities in Discrete Settings , 2006 .

[33]  S. Bobkov On isoperimetric constants for log-concave probability distributions , 2007 .

[34]  Vladas Sidoravicius,et al.  Stochastic Processes and Applications , 2007 .

[35]  C. Villani Optimal Transport: Old and New , 2008 .

[36]  Armin Uhlmann,et al.  Roofs and Convexity , 2010, Entropy.

[37]  M. Yan Extension of Convex Function , 2012, 1207.0944.

[38]  H. Qian,et al.  Landscapes of non-gradient dynamics without detailed balance: stable limit cycles and multiple attractors. , 2010, Chaos.

[39]  A. Dalalyan Theoretical guarantees for approximate sampling from smooth and log‐concave densities , 2014, 1412.7392.

[40]  G. Pavliotis Stochastic Processes and Applications: Diffusion Processes, the Fokker-Planck and Langevin Equations , 2014 .

[41]  Francis R. Bach,et al.  Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression , 2013, J. Mach. Learn. Res..

[42]  Bernd Sturmfels,et al.  Maximum Likelihood Estimates for Gaussian Mixtures Are Transcendental , 2015, MACIS.

[43]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[44]  É. Moulines,et al.  Non-asymptotic convergence analysis for the Unadjusted Langevin Algorithm , 2015, 1507.05021.

[45]  Martin J. Wainwright,et al.  Local Maxima in the Likelihood of Gaussian Mixture Models: Structural Results and Algorithmic Consequences , 2016, NIPS.

[46]  É. Moulines,et al.  Sampling from a strongly log-concave distribution with the Unadjusted Langevin Algorithm , 2016 .

[47]  Gareth O. Roberts,et al.  Complexity bounds for Markov chain Monte Carlo algorithms via diffusion limits , 2016, Journal of Applied Probability.

[48]  Prateek Jain,et al.  Non-convex Optimization for Machine Learning , 2017, Found. Trends Mach. Learn..

[49]  Matus Telgarsky,et al.  Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis , 2017, COLT.

[50]  Oren Mangoubi,et al.  Rapid Mixing of Hamiltonian Monte Carlo on Strongly Log-Concave Distributions , 2017, 1708.07114.

[51]  Michael I. Jordan,et al.  Underdamped Langevin MCMC: A non-asymptotic analysis , 2017, COLT.

[52]  Peter L. Bartlett,et al.  Convergence of Langevin MCMC in KL-divergence , 2017, ALT.

[53]  Andre Wibisono,et al.  Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem , 2018, COLT.

[54]  Nisheeth K. Vishnoi,et al.  Dimensionally Tight Running Time Bounds for Second-Order Hamiltonian Monte Carlo , 2018, ArXiv.

[55]  Mateusz B. Majka,et al.  Nonasymptotic bounds for sampling algorithms without log-concavity , 2018, The Annals of Applied Probability.

[56]  Michael I. Jordan,et al.  Sharp Convergence Rates for Langevin Dynamics in the Nonconvex Setting , 2018, ArXiv.

[57]  Martin J. Wainwright,et al.  Log-concave sampling: Metropolis-Hastings algorithms are fast! , 2018, COLT.

[58]  Michael I. Jordan,et al.  Is There an Analog of Nesterov Acceleration for MCMC? , 2019, ArXiv.

[59]  Arnak S. Dalalyan,et al.  User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient , 2017, Stochastic Processes and their Applications.

[60]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[61]  A. Eberle,et al.  Couplings and quantitative contraction rates for Langevin dynamics , 2017, The Annals of Probability.

[62]  Alain Durmus,et al.  High-dimensional Bayesian inference via the unadjusted Langevin algorithm , 2016, Bernoulli.

[63]  Yair Carmon,et al.  Lower bounds for finding stationary points I , 2017, Mathematical Programming.

[64]  A. Eberle,et al.  Coupling and convergence for Hamiltonian Monte Carlo , 2018, The Annals of Applied Probability.