Accelerated Information Gradient Flow

We present a framework for Nesterov's accelerated gradient flows in probability space. Here four examples of information metrics are considered, including Fisher-Rao metric, Wasserstein-2 metric, Kalman-Wasserstein metric and Stein metric. For both Fisher-Rao and Wasserstein-2 metrics, we prove convergence properties of accelerated gradient flows. In implementations, we propose a sampling-efficient discrete-time algorithm for Wasserstein-2, Kalman-Wasserstein and Stein accelerated gradient flows with a restart technique. We also formulate a kernel bandwidth selection method, which learns the gradient of logarithm of density from Brownian-motion samples. Numerical experiments, including Bayesian logistic regression and Bayesian neural network, show the strength of the proposed methods compared with state-of-the-art algorithms.

[1]  Qiang Liu,et al.  Stein Variational Gradient Descent as Gradient Flow , 2017, NIPS.

[2]  J. Carrillo,et al.  A blob method for diffusion , 2017, Calculus of Variations and Partial Differential Equations.

[3]  Andrew M. Stuart,et al.  Inverse problems: A Bayesian perspective , 2010, Acta Numerica.

[4]  Shakir Mohamed,et al.  Variational Inference with Normalizing Flows , 2015, ICML.

[5]  C. Villani,et al.  Generalization of an Inequality by Talagrand and Links with the Logarithmic Sobolev Inequality , 2000 .

[6]  Michael I. Jordan,et al.  Underdamped Langevin MCMC: A non-asymptotic analysis , 2017, COLT.

[7]  C. Villani Topics in Optimal Transportation , 2003 .

[8]  K. Modin Geometry of Matrix Decompositions Seen Through Optimal Transport and Information Geometry , 2016, 1601.01875.

[9]  Shui-Nee Chow,et al.  Wasserstein Hamiltonian flows , 2019 .

[10]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[11]  Stanley Osher,et al.  Wasserstein Proximal of GANs , 2018, GSI.

[12]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[13]  Yee Whye Teh,et al.  Hamiltonian Descent Methods , 2018, ArXiv.

[14]  F. Otto THE GEOMETRY OF DISSIPATIVE EVOLUTION EQUATIONS: THE POROUS MEDIUM EQUATION , 2001 .

[15]  A. Duncan,et al.  On the geometry of Stein variational gradient descent , 2019, ArXiv.

[16]  R. Singh Improvement on Some Known Nonparametric Uniformly Consistent Estimators of Derivatives of a Density , 1977 .

[17]  J. Lafferty The density manifold and configuration space quantization , 1988 .

[18]  Amirhossein Taghvaei,et al.  Accelerated Flow for Probability Distributions , 2019, ICML.

[19]  Luigi Malagò,et al.  Wasserstein Riemannian Geometry of Positive Definite Matrices , 2018, 1801.09269.

[20]  Ruiyi Zhang,et al.  Particle Optimization in Stochastic Gradient MCMC , 2017, 1711.10927.

[21]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[22]  Qiang Liu,et al.  Stein Variational Gradient Descent With Matrix-Valued Kernels , 2019, NeurIPS.

[23]  Espen Bernton,et al.  Langevin Monte Carlo and JKO splitting , 2018, COLT.

[24]  Stephen P. Boyd,et al.  A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights , 2014, J. Mach. Learn. Res..

[25]  Chang Liu,et al.  Accelerated First-order Methods on the Wasserstein Space for Bayesian Inference , 2018, ArXiv.

[26]  Andre Wibisono,et al.  Proximal Langevin Algorithm: Rapid Convergence Under Isoperimetry , 2019, ArXiv.

[27]  Andrew M. Stuart,et al.  Interacting Langevin Diffusions: Gradient Structure and Ensemble Kalman Sampler , 2019, SIAM J. Appl. Dyn. Syst..

[28]  Chang Liu,et al.  Understanding and Accelerating Particle-Based Variational Inference , 2018, ICML.

[29]  Lawrence Carin,et al.  Policy Optimization as Wasserstein Gradient Flows , 2018, ICML.

[30]  Bai Li,et al.  A Unified Particle-Optimization Framework for Scalable Bayesian Sampling , 2018, UAI.

[31]  Suvrit Sra,et al.  Towards Riemannian Accelerated Gradient Methods , 2018, ArXiv.

[32]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[33]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[34]  G. Deco,et al.  An Information-Theoretic Approach to Neural Computing , 1997, Perspectives in Neural Computing.

[35]  Andre Wibisono,et al.  A variational perspective on accelerated methods in optimization , 2016, Proceedings of the National Academy of Sciences.

[36]  Asuka Takatsu,et al.  On Wasserstein geometry of the space of Gaussian measures , 2008 .

[37]  Hong Cheng,et al.  Accelerated First-order Methods for Geodesically Convex Optimization on Riemannian Manifolds , 2017, NIPS.

[38]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[39]  Emmanuel J. Candès,et al.  Adaptive Restart for Accelerated Gradient Schemes , 2012, Foundations of Computational Mathematics.

[40]  Dilin Wang,et al.  Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm , 2016, NIPS.

[41]  Matteo Matteucci,et al.  Natural gradient, fitness modelling and model selection: A unifying perspective , 2013, 2013 IEEE Congress on Evolutionary Computation.

[42]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[43]  Deniz Erdogmus,et al.  Information Theoretic Learning , 2005, Encyclopedia of Artificial Intelligence.

[44]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[45]  Z. Wen,et al.  The Search direction Correction makes first-order methods faster , 2019, 1905.06507.

[46]  Michael I. Jordan,et al.  Is There an Analog of Nesterov Acceleration for MCMC? , 2019, ArXiv.

[47]  Stuart German,et al.  Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images , 1988 .

[48]  Donald Geman,et al.  Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images , 1984 .

[49]  Roger B. Grosse,et al.  Optimizing Neural Networks with Kronecker-factored Approximate Curvature , 2015, ICML.

[50]  J. Carrillo,et al.  Convergence to Equilibrium in Wasserstein Distance for Damped Euler Equations with Interaction Forces , 2017, Communications in Mathematical Physics.

[51]  Shun-ichi Amari,et al.  Differential geometry of statistical inference , 1983 .

[52]  Wuchen Li,et al.  Affine Natural Proximal Learning , 2019, GSI.

[53]  Yifei Wang,et al.  Information Newton's flow: second-order optimization method in probability space , 2020, ArXiv.

[54]  Radford M. Neal MCMC Using Hamiltonian Dynamics , 2011, 1206.1901.