The Wasserstein Proximal Gradient Algorithm

Wasserstein gradient flows are continuous time dynamics that define curves of steepest descent to minimize an objective function over the space of probability measures (i.e., the Wasserstein space). This objective is typically a divergence w.r.t. a fixed target distribution. In recent years, these continuous time dynamics have been used to study the convergence of machine learning algorithms aiming at approximating a probability distribution. However, the discrete-time behavior of these algorithms might differ from the continuous time dynamics. Besides, although discretized gradient flows have been proposed in the literature, little is known about their minimization power. In this work, we propose a Forward Backward (FB) discretization scheme that can tackle the case where the objective function is the sum of a smooth and a nonsmooth geodesically convex terms. Using techniques from convex optimization and optimal transport, we analyze the FB scheme as a minimization algorithm on the Wasserstein space. More precisely, we show under mild assumptions that the FB scheme has convergence guarantees similar to the proximal gradient algorithm in Euclidean spaces (resp. similar to the associated Wasserstein gradient flow).

[1]  S. Sorin,et al.  Evolution equations for maximal monotone operators: asymptotic analysis in continuous and discrete time , 2009, 0905.1270.

[2]  Espen Bernton,et al.  Langevin Monte Carlo and JKO splitting , 2018, COLT.

[3]  Stephen P. Boyd,et al.  A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights , 2014, J. Mach. Learn. Res..

[4]  Qiang Liu,et al.  Stein Variational Gradient Descent as Gradient Flow , 2017, NIPS.

[5]  Francis R. Bach,et al.  On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions , 2015, J. Mach. Learn. Res..

[6]  W. Hachem,et al.  A constant step Forward-Backward algorithm involving random maximal monotone operators , 2017, 1702.04144.

[7]  Yifei Wang,et al.  Accelerated Information Gradient Flow , 2019, Journal of Scientific Computing.

[8]  Peter Richtárik,et al.  A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent , 2019, AISTATS.

[9]  Prashant G. Mehta,et al.  Accelerated Gradient Flow for Probability Distributions , 2018 .

[10]  F. Santambrogio {Euclidean, metric, and Wasserstein} gradient flows: an overview , 2016, 1609.03890.

[11]  B. Martinet Brève communication. Régularisation d'inéquations variationnelles par approximations successives , 1970 .

[12]  Andre Wibisono,et al.  Sampling as optimization in the space of measures: The Langevin dynamics as a composite optimization problem , 2018, COLT.

[13]  Shiqian Ma,et al.  Proximal Gradient Method for Nonsmooth Optimization over the Stiefel Manifold , 2018, SIAM J. Optim..

[14]  J. Bolte,et al.  A family of functional inequalities: Łojasiewicz inequalities and displacement convex functions , 2016, Journal of Functional Analysis.

[15]  D. Kinderlehrer,et al.  THE VARIATIONAL FORMULATION OF THE FOKKER-PLANCK EQUATION , 1996 .

[16]  Y. Brenier Polar Factorization and Monotone Rearrangement of Vector-Valued Functions , 1991 .

[17]  Alain Durmus,et al.  High-dimensional Bayesian inference via the unadjusted Langevin algorithm , 2016, Bernoulli.

[18]  Brendan Maginnis,et al.  On Wasserstein Reinforcement Learning and the Fokker-Planck equation , 2017, ArXiv.

[19]  Gabriel Peyré,et al.  Entropic Approximation of Wasserstein Gradient Flows , 2015, SIAM J. Imaging Sci..

[20]  Santosh S. Vempala,et al.  Rapid Convergence of the Unadjusted Langevin Algorithm: Isoperimetry Suffices , 2019, NeurIPS.

[21]  Ke Wei,et al.  Riemannian proximal gradient methods , 2019, Mathematical Programming.

[22]  Dilin Wang,et al.  Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm , 2016, NIPS.

[23]  Heinz H. Bauschke,et al.  Convex Analysis and Monotone Operator Theory in Hilbert Spaces , 2011, CMS Books in Mathematics.

[24]  Tyler Maunu,et al.  Gradient descent algorithms for Bures-Wasserstein barycenters , 2020, COLT.

[25]  Peter Richtárik,et al.  Stochastic Proximal Langevin Algorithm: Potential Splitting and Nonasymptotic Rates , 2019, NeurIPS.

[26]  Alain Durmus,et al.  Analysis of Langevin Monte Carlo via Convex Optimization , 2018, J. Mach. Learn. Res..

[27]  L. Ambrosio,et al.  Gradient Flows: In Metric Spaces and in the Space of Probability Measures , 2005 .

[28]  F. Otto THE GEOMETRY OF DISSIPATIVE EVOLUTION EQUATIONS: THE POROUS MEDIUM EQUATION , 2001 .

[29]  A. Duncan,et al.  On the geometry of Stein variational gradient descent , 2019, ArXiv.

[30]  Peter L. Bartlett,et al.  Convergence of Langevin MCMC in KL-divergence , 2017, ALT.

[31]  Guillaume Carlier,et al.  Barycenters in the Wasserstein Space , 2011, SIAM J. Math. Anal..

[32]  Tomaso A. Poggio,et al.  Approximate inference with Wasserstein gradient flows , 2018, AISTATS.

[33]  Arthur Gretton,et al.  Maximum Mean Discrepancy Gradient Flow , 2019, NeurIPS.

[34]  Michael I. Jordan,et al.  Is There an Analog of Nesterov Acceleration for MCMC? , 2019, ArXiv.

[35]  Laurent Condat,et al.  Proximal splitting algorithms: Relax them all! , 2019 .

[36]  R. McCann Polar factorization of maps on Riemannian manifolds , 2001 .

[37]  Gabriel Peyré,et al.  Stochastic Optimization for Large-scale Optimal Transport , 2016, NIPS.

[38]  Andre Wibisono,et al.  Proximal Langevin Algorithm: Rapid Convergence Under Isoperimetry , 2019, ArXiv.

[39]  Andrea Montanari,et al.  Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit , 2019, COLT.