Statistical and Topological Properties of Sliced Probability Divergences

The idea of slicing divergences has been proven to be successful when comparing two probability measures in various machine learning applications including generative modeling, and consists in computing the expected value of a `base divergence' between one-dimensional random projections of the two measures. However, the computational and statistical consequences of such a technique have not yet been well-established. In this paper, we aim at bridging this gap and derive some properties of sliced divergence functions. First, we show that slicing preserves the metric axioms and the weak continuity of the divergence, implying that the sliced divergence will share similar topological properties. We then precise the results in the case where the base divergence belongs to the class of integral probability metrics. On the other hand, we establish that, under mild conditions, the sample complexity of the sliced divergence does not depend on the dimension, even when the base divergence suffers from the curse of dimensionality. We finally apply our general results to the Wasserstein distance and Sinkhorn divergences, and illustrate our theory on both synthetic and real data experiments.

[1]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[2]  A. Guillin,et al.  On the rate of convergence in Wasserstein distance of the empirical measure , 2013, 1312.2128.

[3]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[4]  Roland Badeau,et al.  Asymptotic Guarantees for Learning Generative Models with the Sliced-Wasserstein Distance , 2019, NeurIPS.

[5]  C. Villani Optimal Transport: Old and New , 2008 .

[6]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[7]  A. Martin-Löf On the composition of elementary errors , 1994 .

[8]  Nicolas Bonnotte Unidimensional and Evolution Methods for Optimal Transportation , 2013 .

[9]  Alison L Gibbs,et al.  On Choosing and Bounding Probability Metrics , 2002, math/0209021.

[10]  Gabriel Peyré,et al.  Sample Complexity of Sinkhorn Divergences , 2018, AISTATS.

[11]  Antoine Liutkus,et al.  Sliced-Wasserstein Flows: Nonparametric Generative Modeling via Optimal Transport and Diffusions , 2018, ICML.

[12]  Alexander G. Schwing,et al.  Generative Modeling Using the Sliced Wasserstein Distance , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[13]  J. K. Hunter,et al.  Measure Theory , 2007 .

[14]  Alain Trouvé,et al.  Interpolating between Optimal Transport and MMD using Sinkhorn Divergences , 2018, AISTATS.

[15]  Suvrit Sra,et al.  Directional Statistics in Machine Learning: a Brief Review , 2016, 1605.00316.

[16]  O. Bousquet,et al.  From optimal transport to generative modeling: the VEGAN cookbook , 2017, 1705.07642.

[17]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[18]  Julien Rabin,et al.  Sliced and Radon Wasserstein Barycenters of Measures , 2014, Journal of Mathematical Imaging and Vision.

[19]  J. Lorenz,et al.  On the scaling of multidimensional matrices , 1989 .

[20]  Heiko Hoffmann,et al.  Sliced Wasserstein Distance for Learning Gaussian Mixture Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[21]  F. Bach,et al.  Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance , 2017, Bernoulli.

[22]  Roland Badeau,et al.  Generalized Sliced Wasserstein Distances , 2019, NeurIPS.

[23]  Gustavo K. Rohde,et al.  Sliced Wasserstein Auto-Encoders , 2018, ICLR.

[24]  Gabriel Peyré,et al.  Learning Generative Models with Sinkhorn Divergences , 2017, AISTATS.

[25]  Jason Altschuler,et al.  Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration , 2017, NIPS.

[26]  Sivaraman Balakrishnan,et al.  Minimax Confidence Intervals for the Sliced Wasserstein Distance , 2019 .

[27]  Gabriel Peyré,et al.  Stochastic Optimization for Large-scale Optimal Transport , 2016, NIPS.

[28]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[29]  Jonathan Weed,et al.  Statistical bounds for entropic optimal transport: sample complexity and the central limit theorem , 2019, NeurIPS.

[30]  Kenji Fukumizu,et al.  On integral probability metrics, φ-divergences and binary classification , 2009, 0901.2698.

[31]  Marc G. Bellemare,et al.  The Cramer Distance as a Solution to Biased Wasserstein Gradients , 2017, ArXiv.

[32]  Marco Cuturi,et al.  Subspace Robust Wasserstein distances , 2019, ICML.

[33]  D. Vere-Jones Markov Chains , 1972, Nature.

[34]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[35]  David A. Forsyth,et al.  Max-Sliced Wasserstein Distance and Its Use for GANs , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[36]  Nicolas Courty,et al.  Sliced Gromov-Wasserstein , 2019, NeurIPS.

[37]  Kellen Petersen August Real Analysis , 2009 .

[38]  Bernhard Schölkopf,et al.  Wasserstein Auto-Encoders , 2017, ICLR.

[39]  A. Müller Integral Probability Metrics and Their Generating Classes of Functions , 1997, Advances in Applied Probability.

[40]  Luc Van Gool,et al.  Sliced Wasserstein Generative Models , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[41]  Julien Rabin,et al.  Wasserstein Barycenter and Its Application to Texture Mixing , 2011, SSVM.

[42]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[43]  Soheil Kolouri,et al.  Sliced Cramer Synaptic Consolidation for Preserving Deeply Learned Representations , 2020, ICLR.

[44]  Arthur Gretton,et al.  Maximum Mean Discrepancy Gradient Flow , 2019, NeurIPS.

[45]  L. Ambrosio,et al.  Gradient Flows: In Metric Spaces and in the Space of Probability Measures , 2005 .

[46]  G. A. Young,et al.  High‐dimensional Statistics: A Non‐asymptotic Viewpoint, Martin J.Wainwright, Cambridge University Press, 2019, xvii 552 pages, £57.99, hardback ISBN: 978‐1‐1084‐9802‐9 , 2020, International Statistical Review.

[47]  Florence Merlev THE EMPIRICAL DISTRIBUTION FUNCTION FOR DEPENDENT VARIABLES: ASYMPTOTIC AND NONASYMPTOTIC RESULTS IN L p , 2007 .

[48]  Gabriel Peyré,et al.  Computational Optimal Transport , 2018, Found. Trends Mach. Learn..

[49]  Erhan Bayraktar,et al.  Strong equivalence between metrics of Wasserstein type , 2019, 1912.08247.

[50]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .