Black-Box Variational Inference as Distilled Langevin Dynamics

Variational inference (VI) and Markov chain Monte Carlo (MCMC) are approximate posterior inference algorithms that are often said to have complementary strengths, with VI being fast but biased and MCMC being slower but asymptotically unbiased. In this paper, we analyze gradientbased MCMC and VI procedures and find theoretical and empirical evidence that these procedures are not as different as one might think. In particular, a close examination of the FokkerPlanck equation that governs the Langevin dynamics (LD) MCMC procedure reveals that LD implicitly follows a gradient flow that corresponds to a variational inference procedure based on optimizing a nonparametric normalizing flow. This result suggests that the transient bias of LD (due to the Markov chain not having burned in) may track that of VI (due to the optimizer not having converged), up to differences due to VI’s asymptotic bias and parameterization. Empirically, we find that the transient biases of these algorithms (and their momentum-accelerated counterparts) do evolve similarly. This suggests that practitioners with a limited time budget may get more accurate results by running an MCMC procedure (even if it’s far from burned in) than a VI procedure, as long as the variance of the MCMC estimator can be dealt with (e.g., by running many parallel chains).

[1]  Maxim Raginsky,et al.  Theoretical guarantees for sampling and inference in generative models with latent diffusions , 2019, COLT.

[2]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[3]  Martin J. Wainwright,et al.  Log-concave sampling: Metropolis-Hastings algorithms are fast! , 2018, COLT.

[4]  Andre Wibisono,et al.  Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem , 2018, COLT.

[5]  Alain Durmus,et al.  High-dimensional Bayesian inference via the unadjusted Langevin algorithm , 2016, Bernoulli.

[6]  Oren Mangoubi,et al.  Rapid Mixing of Hamiltonian Monte Carlo on Strongly Log-Concave Distributions , 2017, 1708.07114.

[7]  Michael I. Jordan,et al.  Underdamped Langevin MCMC: A non-asymptotic analysis , 2017, COLT.

[8]  David Duvenaud,et al.  Sticking the Landing: Simple, Lower-Variance Gradient Estimators for Variational Inference , 2017, NIPS.

[9]  T. Jaakkola,et al.  Improving the Mean Field Approximation Via the Use of Mixture Distributions , 1999, Learning in Graphical Models.

[10]  Daan Wierstra,et al.  Stochastic Backpropagation and Approximate Inference in Deep Generative Models , 2014, ICML.

[11]  M. Ledoux Concentration of measure and logarithmic Sobolev inequalities , 1999 .

[12]  Yu Cao,et al.  On explicit $L^2$-convergence rate estimate for underdamped Langevin dynamics , 2019, 1908.04746.

[13]  Nisheeth K. Vishnoi,et al.  Dimensionally Tight Running Time Bounds for Second-Order Hamiltonian Monte Carlo , 2018, ArXiv.

[14]  Michael I. Jordan,et al.  Sampling can be faster than optimization , 2018, Proceedings of the National Academy of Sciences.

[15]  Stephen P. Boyd,et al.  A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights , 2014, J. Mach. Learn. Res..

[16]  Yin Tat Lee,et al.  The Randomized Midpoint Method for Log-Concave Sampling , 2019, NeurIPS.

[17]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[18]  C. Villani Topics in Optimal Transportation , 2003 .

[19]  D. Kinderlehrer,et al.  THE VARIATIONAL FORMULATION OF THE FOKKER-PLANCK EQUATION , 1996 .

[20]  Lei Wu,et al.  Irreversible samplers from jump and continuous Markov processes , 2016, Stat. Comput..

[21]  Martin J. Wainwright,et al.  High-Order Langevin Diffusion Yields an Accelerated MCMC Algorithm , 2019, J. Mach. Learn. Res..

[22]  A. Dalalyan Theoretical guarantees for approximate sampling from smooth and log‐concave densities , 2014, 1412.7392.

[23]  Joshua V. Dillon,et al.  NeuTra-lizing Bad Geometry in Hamiltonian Monte Carlo Using Neural Transport , 2019, 1903.03704.

[24]  Chong Wang,et al.  Stochastic variational inference , 2012, J. Mach. Learn. Res..

[25]  J. Rosenthal,et al.  Optimal scaling of discrete approximations to Langevin diffusions , 1998 .

[26]  Peter L. Bartlett,et al.  Convergence of Langevin MCMC in KL-divergence , 2017, ALT.

[27]  Santosh S. Vempala,et al.  Algorithmic Theory of ODEs and Sampling from Well-conditioned Logconcave Densities , 2018, ArXiv.

[28]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[29]  Sean Gerrish,et al.  Black Box Variational Inference , 2013, AISTATS.

[30]  Arnak S. Dalalyan,et al.  On sampling from a log-concave density using kinetic Langevin diffusions , 2018, Bernoulli.

[31]  Michael I. Jordan,et al.  A Lyapunov Analysis of Momentum Methods in Optimization , 2016, ArXiv.

[32]  Martin Jankowiak,et al.  Pathwise Derivatives Beyond the Reparameterization Trick , 2018, ICML.

[33]  Noah D. Goodman,et al.  Amortized Inference in Probabilistic Reasoning , 2014, CogSci.

[34]  Michael I. Jordan,et al.  Is There an Analog of Nesterov Acceleration for MCMC? , 2019, ArXiv.

[35]  Junpeng Lao,et al.  tfp.mcmc: Modern Markov Chain Monte Carlo Tools Built for Modern Hardware , 2020, ArXiv.

[36]  Arnak S. Dalalyan,et al.  User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient , 2017, Stochastic Processes and their Applications.

[37]  Wuchen Li,et al.  Accelerated Information Gradient flow , 2022, J. Sci. Comput..

[38]  Alex Graves,et al.  Stochastic Backpropagation through Mixture Density Distributions , 2016, ArXiv.

[39]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.