On the geometry of Stein variational gradient descent

Bayesian inference problems require sampling or approximating high-dimensional probability distributions. The focus of this paper is on the recently introduced Stein variational gradient descent methodology, a class of algorithms that rely on iterated steepest descent steps with respect to a reproducing kernel Hilbert space norm. This construction leads to interacting particle systems, the mean-field limit of which is a gradient flow on the space of probability distributions equipped with a certain geometrical structure. We leverage this viewpoint to shed some light on the convergence properties of the algorithm, in particular addressing the problem of choosing a suitable positive definite kernel function. Our analysis leads us to considering certain nondifferentiable kernels with adjusted tails. We demonstrate significant performs gains of these in various numerical experiments.

[1]  Kenji Fukumizu,et al.  Universality, Characteristic Kernels and RKHS Embedding of Measures , 2010, J. Mach. Learn. Res..

[2]  W. Kliemann Recurrence and invariant measures for degenerate diffusions , 1987 .

[3]  Youssef Marzouk,et al.  Greedy inference with structure-exploiting lazy maps , 2020, NeurIPS.

[4]  Grigorios A. Pavliotis,et al.  Constructing Sampling Schemes via Coupling: Markov Semigroups and Optimal Transport , 2018, SIAM/ASA J. Uncertain. Quantification.

[5]  Peng Chen,et al.  Projected Stein Variational Newton: A Fast and Scalable Bayesian Inference Method in High Dimensions , 2019, NeurIPS.

[6]  François Delarue,et al.  Probabilistic Theory of Mean Field Games with Applications I: Mean Field FBSDEs, Control, and Games , 2018 .

[7]  Ruiyi Zhang,et al.  Particle Optimization in Stochastic Gradient MCMC , 2017, 1711.10927.

[8]  Giuseppe Savaré,et al.  Passing to the limit in a Wasserstein gradient flow: from diffusion to reaction , 2011, 1102.1202.

[9]  Luca Ambrogioni,et al.  Wasserstein Variational Gradient Descent: From Semi-Discrete Optimal Transport to Ensemble Variational Inference , 2018, ArXiv.

[10]  Alexander Mielke,et al.  A gradient structure for reaction–diffusion systems and for energy-drift-diffusion systems , 2011 .

[11]  R. Tweedie,et al.  Exponential convergence of Langevin distributions and their discrete approximations , 1996 .

[12]  L. Ambrosio,et al.  Gradient Flows: In Metric Spaces and in the Space of Probability Measures , 2005 .

[13]  L. Brasco A Survey on dynamical transport distances , 2012 .

[14]  R. Khasminskii Stochastic Stability of Differential Equations , 1980 .

[15]  Richard Mateosian,et al.  Old and New , 2006, IEEE Micro.

[16]  Saburou Saitoh,et al.  Theory of Reproducing Kernels and Its Applications , 1988 .

[17]  Alexander Koldobsky,et al.  Fourier Analysis in Convex Geometry , 2005 .

[18]  Changyou Chen,et al.  Stochastic Particle-Optimization Sampling and the Non-Asymptotic Convergence Theory , 2018, AISTATS.

[19]  Jianfeng Lu,et al.  Accelerating Langevin Sampling with Birth-death , 2019, ArXiv.

[20]  Bernhard Schölkopf,et al.  Hilbert Space Embeddings and Metrics on Probability Measures , 2009, J. Mach. Learn. Res..

[21]  Ning Chen,et al.  Message Passing Stein Variational Gradient Descent , 2017, ICML.

[22]  S. Meyn,et al.  Stability of Markovian processes III: Foster–Lyapunov criteria for continuous-time processes , 1993, Advances in Applied Probability.

[23]  Jianfeng Lu,et al.  Scaling Limit of the Stein Variational Gradient Descent: The Mean Field Regime , 2018, SIAM J. Math. Anal..

[24]  Cédric Villani,et al.  Optimal transportation, dissipative PDE’s and functional inequalities , 2003 .

[25]  Sara Daneri,et al.  Eulerian Calculus for the Displacement Convexity in the Wasserstein Distance , 2008, SIAM J. Math. Anal..

[26]  Gregory E. Fasshauer,et al.  Meshfree Approximation Methods with Matlab , 2007, Interdisciplinary Mathematical Sciences.

[27]  Lars Onsager,et al.  Fluctuations and Irreversible Process. II. Systems with Kinetic Energy , 1953 .

[28]  Tianqi Chen,et al.  A Complete Recipe for Stochastic Gradient MCMC , 2015, NIPS.

[29]  Felix Otto,et al.  Eulerian Calculus for the Contraction in the Wasserstein Distance , 2005, SIAM J. Math. Anal..

[30]  Tom Sercu,et al.  Sobolev Descent , 2018, AISTATS.

[31]  Mark A. Peletier,et al.  Variational modelling : energies, gradient flows, and large deviations , 2014, 1402.1990.

[32]  D. Kinderlehrer,et al.  THE VARIATIONAL FORMULATION OF THE FOKKER-PLANCK EQUATION , 1996 .

[33]  E. Kreyszig Introductory Functional Analysis With Applications , 1978 .

[34]  Qiang Liu,et al.  Stein Variational Gradient Descent as Gradient Flow , 2017, NIPS.

[35]  M. Urner Scattered Data Approximation , 2016 .

[36]  Y. Marzouk,et al.  Greedy inference with layers of lazy maps , 2019, 1906.00031.

[37]  M. A. Peletier,et al.  On the Relation between Gradient Flows and the Large-Deviation Principle, with Applications to Markov Chains and Diffusion , 2013, 1312.7591.

[38]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[39]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[40]  Alexander Mielke,et al.  Thermomechanical modeling of energy-reaction-diffusion systems, including bulk-interface interactions , 2012 .

[41]  Qiang Liu,et al.  Stein Variational Gradient Descent as Moment Matching , 2018, NeurIPS.

[42]  G. Pavliotis,et al.  Using Perturbed Underdamped Langevin Dynamics to Efficiently Sample from Probability Distributions , 2017, Journal of Statistical Physics.

[43]  C. Carmeli,et al.  VECTOR VALUED REPRODUCING KERNEL HILBERT SPACES OF INTEGRABLE FUNCTIONS AND MERCER THEOREM , 2006 .

[44]  A. Stuart,et al.  Ensemble Kalman methods for inverse problems , 2012, 1209.2736.

[45]  Friedel Hartmann,et al.  Second-Order Analysis , 1985 .

[46]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[47]  C. Villani,et al.  Generalization of an Inequality by Talagrand and Links with the Logarithmic Sobolev Inequality , 2000 .

[48]  Lei Li,et al.  A stochastic version of Stein Variational Gradient Descent for efficient sampling , 2019, Communications in Applied Mathematics and Computational Science.

[49]  Dilin Wang,et al.  Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm , 2016, NIPS.

[50]  R. McCann A Convexity Principle for Interacting Gases , 1997 .

[51]  Bo Zhang,et al.  Function Space Particle Optimization for Bayesian Neural Networks , 2019, ICLR.

[52]  Andrew M. Stuart,et al.  Interacting Langevin Diffusions: Gradient Structure and Ensemble Kalman Sampler , 2019, SIAM J. Appl. Dyn. Syst..

[53]  Felix Otto,et al.  Dynamics of Labyrinthine Pattern Formation in Magnetic Fluids: A Mean‐Field Theory , 1998 .

[54]  F. Otto THE GEOMETRY OF DISSIPATIVE EVOLUTION EQUATIONS: THE POROUS MEDIUM EQUATION , 2001 .

[55]  Antoine Liutkus,et al.  Sliced-Wasserstein Flows: Nonparametric Generative Modeling via Optimal Transport and Diffusions , 2018, ICML.

[56]  Yu Cheng,et al.  Sobolev GAN , 2017, ICLR.

[57]  Bai Li,et al.  A Unified Particle-Optimization Framework for Scalable Bayesian Sampling , 2018, UAI.

[58]  Bernhard Schölkopf,et al.  Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions , 2009, NIPS.

[59]  Giuseppe Savaré,et al.  A new class of transport distances between measures , 2008, 0803.1235.

[60]  M. Ledoux,et al.  Analysis and Geometry of Markov Diffusion Operators , 2013 .

[61]  Yann Brenier,et al.  A computational fluid mechanics solution to the Monge-Kantorovich mass transfer problem , 2000, Numerische Mathematik.

[62]  I. Oppenheim Beyond Equilibrium Thermodynamics , 2006 .

[63]  Lester W. Mackey,et al.  Stein Points , 2018, ICML.

[64]  Sebastian Reich,et al.  Discrete gradients for computational Bayesian inference , 2019 .

[65]  Wuchen Li,et al.  Natural gradient via optimal transport , 2018, Information Geometry.

[66]  Sebastian Reich,et al.  Fokker-Planck particle systems for Bayesian inference: Computational approaches , 2019, SIAM/ASA J. Uncertain. Quantification.

[67]  Yang Wang,et al.  Deep Generative Learning via Variational Gradient Flow , 2019, ICML.

[68]  Nikolas Nüsken,et al.  Note on Interacting Langevin Diffusions: Gradient Structure and Ensemble Kalman Sampler by Garbuno-Inigo, Hoffmann, Li and Stuart , 2019, ArXiv.

[69]  Chang Liu,et al.  Understanding and Accelerating Particle-Based Variational Inference , 2018, ICML.

[70]  Giorgio C. Buttazzo,et al.  An Optimization Problem for Mass Transportation with Congested Dynamics , 2009, SIAM J. Control. Optim..

[71]  J. Lott Some Geometric Calculations on Wasserstein Space , 2006, math/0612562.

[72]  Nicola Gigli,et al.  Second Order Analysis on (P2(m), W2) , 2012 .

[73]  John M. Lee Riemannian Manifolds: An Introduction to Curvature , 1997 .

[74]  Hoon Kim,et al.  Monte Carlo Statistical Methods , 2000, Technometrics.

[75]  齋藤 三郎,et al.  Theory of reproducing kernels and its applications , 1988 .

[76]  Arthur Gretton,et al.  Maximum Mean Discrepancy Gradient Flow , 2019, NeurIPS.

[77]  Mark A. Peletier,et al.  A Generalization of Onsager’s Reciprocity Relations to Gradient Flows with Nonlinear Mobility , 2015, 1510.06219.

[78]  Liang Zhao,et al.  On the Inclusion Relation of Reproducing Kernel Hilbert Spaces , 2011, ArXiv.

[79]  Qiang Liu,et al.  Stein Variational Gradient Descent With Matrix-Valued Kernels , 2019, NeurIPS.

[80]  Tiangang Cui,et al.  A Stein variational Newton method , 2018, NeurIPS.

[81]  David Ríos Insua,et al.  Stochastic Gradient MCMC with Repulsive Forces , 2018, ArXiv.

[82]  Jos'e Antonio Carrillo,et al.  Nonlinear mobility continuity equations and generalized displacement convexity , 2009, 0901.3978.

[83]  Chang Liu,et al.  Riemannian Stein Variational Gradient Descent for Bayesian Inference , 2017, AAAI.

[84]  Chang Liu,et al.  Accelerated First-order Methods on the Wasserstein Space for Bayesian Inference , 2018, ArXiv.

[85]  Alan C. Hindmarsh,et al.  A Polyalgorithm for the Numerical Solution of Ordinary Differential Equations , 1975, TOMS.

[86]  G. Pavliotis Stochastic Processes and Applications: Diffusion Processes, the Fokker-Planck and Langevin Equations , 2014 .

[87]  Matthias Liero,et al.  Gradient structures and geodesic convexity for reaction–diffusion systems , 2012, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[88]  Djalil CHAFAÏ,et al.  Dynamics of a planar Coulomb gas , 2017, The Annals of Applied Probability.

[89]  Tiangang Cui,et al.  Stein Variational Online Changepoint Detection with Applications to Hawkes Processes and Neural Networks , 2019, ArXiv.

[90]  Charles A. Micchelli,et al.  On Learning Vector-Valued Functions , 2005, Neural Computation.

[91]  C. Villani Topics in Optimal Transportation , 2003 .

[92]  Manuel Pulido,et al.  Kernel embedding of maps for sequential Bayesian inference: The variational mapping particle filter , 2018, ArXiv.

[93]  Nikolas Nüsken,et al.  Affine invariant interacting Langevin dynamics for Bayesian inference , 2020, SIAM J. Appl. Dyn. Syst..

[94]  L. Ambrosio,et al.  A User’s Guide to Optimal Transport , 2013 .