KALE Flow: A Relaxed KL Gradient Flow for Probabilities with Disjoint Support

We study the gradient flow for a relaxed approximation to the Kullback-Leibler (KL) divergence between a moving source and a fixed target distribution. This approximation, termed the KALE (KL Approximate Lower bound Estimator), solves a regularized version of the Fenchel dual problem defining the KL over a restricted class of functions. When using a Reproducing Kernel Hilbert Space (RKHS) to define the function class, we show that the KALE continuously interpolates between the KL and the Maximum Mean Discrepancy (MMD). Like the MMD and other Integral Probability Metrics, the KALE remains well-defined for mutually singular distributions. Nonetheless, the KALE inherits from the limiting KL a greater sensitivity to mismatch in the support of the distributions, compared with the MMD. These two properties make the KALE gradient flow particularly well suited when the target distribution is supported on a low-dimensional manifold. Under an assumption of sufficient smoothness of the trajectories, we show the global convergence of the KALE flow. We propose a particle implementation of the flow given initial samples from the source and the target distribution, which we use to empirically confirm the KALE’s properties.

[1]  Loucas Pillaud-Vivien,et al.  Statistical Estimation of the Poincaré constant and Application to Sampling Multimodal Distributions , 2019, AISTATS.

[2]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[3]  Arnak S. Dalalyan,et al.  User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient , 2017, Stochastic Processes and their Applications.

[4]  Lawrence Cayton,et al.  Algorithms for manifold learning , 2005 .

[5]  Jorge Nocedal,et al.  On the limited memory BFGS method for large scale optimization , 1989, Math. Program..

[6]  A. Gretton,et al.  Generalized Energy Based Models , 2020, ICLR.

[7]  Krikamol Muandet,et al.  Minimax Estimation of Kernel Mean Embeddings , 2016, J. Mach. Learn. Res..

[8]  D. Kinderlehrer,et al.  THE VARIATIONAL FORMULATION OF THE FOKKER-PLANCK EQUATION , 1996 .

[9]  F. Bach,et al.  Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance , 2017, Bernoulli.

[10]  Yu Cheng,et al.  Sobolev GAN , 2017, ICLR.

[11]  Sebastian Nowozin,et al.  Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks , 2017, ICML.

[12]  Alejandro Ribeiro,et al.  Sinkhorn Barycenter via Functional Gradient Descent , 2020, NeurIPS.

[13]  Sébastien Bubeck,et al.  Sampling from a Log-Concave Distribution with Projected Langevin Monte Carlo , 2015, Discrete & Computational Geometry.

[14]  Adil Salim,et al.  Primal Dual Interpretation of the Proximal Stochastic Gradient Langevin Algorithm , 2020, NeurIPS.

[15]  Arthur Gretton,et al.  On gradient regularizers for MMD GANs , 2018, NeurIPS.

[16]  Youssef Mroueh,et al.  Unbalanced Sobolev Descent , 2020, NeurIPS.

[17]  H. McKean,et al.  A CLASS OF MARKOV PROCESSES ASSOCIATED WITH NONLINEAR PARABOLIC EQUATIONS , 1966, Proceedings of the National Academy of Sciences of the United States of America.

[18]  A. ADoefaa,et al.  ? ? ? ? f ? ? ? ? ? , 2003 .

[19]  Zoubin Ghahramani,et al.  Training generative neural networks via Maximum Mean Discrepancy optimization , 2015, UAI.

[20]  Tao Xu,et al.  On the Discrimination-Generalization Tradeoff in GANs , 2017, ICLR.

[21]  Grant M. Rotskoff,et al.  Neural Networks as Interacting Particle Systems: Asymptotic Convexity of the Loss Landscape and Universal Scaling of the Approximation Error , 2018, ArXiv.

[22]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[23]  P. Alam ‘T’ , 2021, Composites Engineering: An A–Z Guide.

[24]  F. Santambrogio Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling , 2015 .

[25]  Peter Harremoës,et al.  Rényi Divergence and Kullback-Leibler Divergence , 2012, IEEE Transactions on Information Theory.

[26]  Arthur Gretton,et al.  Maximum Mean Discrepancy Gradient Flow , 2019, NeurIPS.

[27]  É. Moulines,et al.  Non-asymptotic convergence analysis for the Unadjusted Langevin Algorithm , 2015, 1507.05021.

[28]  Alain Durmus,et al.  Analysis of Langevin Monte Carlo via Convex Optimization , 2018, J. Mach. Learn. Res..

[29]  Dudley,et al.  Real Analysis and Probability: Measurability: Borel Isomorphism and Analytic Sets , 2002 .

[30]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .

[31]  L. Ambrosio,et al.  Gradient Flows: In Metric Spaces and in the Space of Probability Measures , 2005 .

[32]  David M. Blei,et al.  Variational Inference: A Review for Statisticians , 2016, ArXiv.

[33]  Qiang Liu,et al.  Stein Variational Gradient Descent as Gradient Flow , 2017, NIPS.

[34]  C. Villani,et al.  Generalization of an Inequality by Talagrand and Links with the Logarithmic Sobolev Inequality , 2000 .

[35]  Hyunjoong Kim,et al.  Functional Analysis I , 2017 .

[36]  S. M. Ali,et al.  A General Class of Coefficients of Divergence of One Distribution from Another , 1966 .

[37]  Michael I. Jordan,et al.  Is there an analog of Nesterov acceleration for gradient-based MCMC? , 2021 .

[38]  Kwangjun Ahn,et al.  Efficient constrained sampling via the mirror-Langevin algorithm , 2020, NeurIPS.

[39]  Tom Sercu,et al.  Sobolev Descent , 2018, AISTATS.

[40]  H. Daniels,et al.  The Asymptotic Efficiency of a Maximum Likelihood Estimator , 1961 .

[41]  Volkan Cevher,et al.  Mirrored Langevin Dynamics , 2018, NeurIPS.

[42]  Justin A. Sirignano,et al.  Mean field analysis of neural networks: A central limit theorem , 2018, Stochastic Processes and their Applications.

[43]  Arthur Gretton,et al.  Demystifying MMD GANs , 2018, ICLR.

[44]  Léon Bottou,et al.  Towards Principled Methods for Training Generative Adversarial Networks , 2017, ICLR.

[45]  Francis Bach,et al.  On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport , 2018, NeurIPS.

[46]  Bernhard Schölkopf,et al.  Kernel Distribution Embeddings: Universal Kernels, Characteristic Kernels and Kernel Metrics on Distributions , 2016, J. Mach. Learn. Res..

[47]  Gabriel Peyré,et al.  Learning Generative Models with Sinkhorn Divergences , 2017, AISTATS.

[48]  Ding-Xuan Zhou Derivative reproducing properties for kernel methods in learning theory , 2008 .

[49]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[50]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[51]  Hariharan Narayanan,et al.  Sample Complexity of Testing the Manifold Hypothesis , 2010, NIPS.

[52]  Brigitte Maier,et al.  Fundamentals Of Differential Geometry , 2016 .

[53]  Richard S. Zemel,et al.  Generative Moment Matching Networks , 2015, ICML.

[54]  David Lopez-Paz,et al.  Geometrical Insights for Implicit Generative Modeling , 2017, Braverman Readings in Machine Learning.

[55]  C. Villani Optimal Transport: Old and New , 2008 .

[56]  F. Santambrogio {Euclidean, metric, and Wasserstein} gradient flows: an overview , 2016, 1609.03890.

[57]  Yiming Yang,et al.  MMD GAN: Towards Deeper Understanding of Moment Matching Network , 2017, NIPS.

[58]  Marc G. Bellemare,et al.  The Cramer Distance as a Solution to Biased Wasserstein Gradients , 2017, ArXiv.

[59]  Paul R. Milgrom,et al.  Envelope Theorems for Arbitrary Choice Sets , 2002 .

[60]  Alain Trouvé,et al.  Interpolating between Optimal Transport and MMD using Sinkhorn Divergences , 2018, AISTATS.

[61]  Gabriel Peyré,et al.  Sample Complexity of Sinkhorn Divergences , 2018, AISTATS.

[62]  Kamalika Chaudhuri,et al.  Approximation and Convergence Properties of Generative Adversarial Learning , 2017, NIPS.

[63]  Martin J. Wainwright,et al.  Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization , 2008, IEEE Transactions on Information Theory.

[64]  Leonidas J. Guibas,et al.  Synchronizing Probability Measures on Rotations via Optimal Transport , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[65]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[66]  Bharath K. Sriperumbudur On the optimal estimation of probability measures in weak and strong topologies , 2013, 1310.8240.

[67]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[68]  A. Müller Integral Probability Metrics and Their Generating Classes of Functions , 1997, Advances in Applied Probability.

[69]  Sebastian Nowozin,et al.  f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization , 2016, NIPS.