On Projection Robust Optimal Transport: Sample Complexity and Model Misspecification

Optimal transport (OT) distances are increasingly used as loss functions for statistical inference, notably in the learning of generative models or supervised learning. Yet, the behavior of minimum Wasserstein estimators is poorly understood, notably in high-dimensional regimes or under model misspecification. In this work we adopt the viewpoint of projection robust (PR) OT, which seeks to maximize the OT cost between two measures by choosing a $k$-dimensional subspace onto which they can be projected. Our first contribution is to establish several fundamental statistical properties of PR Wasserstein distances, complementing and improving previous literature that has been restricted to one-dimensional and well-specified cases. Next, we propose the integral PR Wasserstein (IPRW) distance as an alternative to the PRW distance, by averaging rather than optimizing on subspaces. Our complexity bounds can help explain why both PRW and IPRW distances outperform Wasserstein distances empirically in high-dimensional inference tasks. Finally, we consider parametric inference using the PRW distance. We provide an asymptotic guarantee of two types of minimum PRW estimators and formulate a central limit theorem for max-sliced Wasserstein estimator under model misspecification. To enable our analysis on PRW with projection dimension larger than one, we devise a novel combination of variational analysis and statistical theory.

[1]  Roland Badeau,et al.  Asymptotic Guarantees for Learning Generative Models with the Sliced-Wasserstein Distance , 2019, NeurIPS.

[2]  P. Rigollet,et al.  Reconstruction of developmental landscapes by optimal-transport analysis of single-cell gene expression sheds light on cellular reprogramming , 2017, bioRxiv.

[3]  Steve Oudot,et al.  Sliced Wasserstein Kernel for Persistence Diagrams , 2017, ICML.

[4]  Christian P. Robert,et al.  On parameter estimation with the Wasserstein distance , 2017, Information and Inference: A Journal of the IMA.

[5]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[6]  Martin J. Wainwright,et al.  High-Dimensional Statistics , 2019 .

[7]  W. Linde STABLE NON‐GAUSSIAN RANDOM PROCESSES: STOCHASTIC MODELS WITH INFINITE VARIANCE , 1996 .

[8]  Gaoyue Guo,et al.  Strong equivalence between metrics of Wasserstein type , 2019, 1912.08247.

[9]  Gabriel Peyré,et al.  Computational Optimal Transport , 2018, Found. Trends Mach. Learn..

[10]  Anthony Man-Cho So,et al.  Quadratic optimization with orthogonality constraint: explicit Łojasiewicz exponent and linear convergence of retraction-based line-search and stochastic variance-reduced gradient methods , 2018, Mathematical Programming.

[11]  A. Guillin,et al.  On the rate of convergence in Wasserstein distance of the empirical measure , 2013, 1312.2128.

[12]  R. Bass,et al.  Review: P. Billingsley, Convergence of probability measures , 1971 .

[13]  F. Bassetti,et al.  On minimum Kantorovich distance estimators , 2006 .

[14]  Julien Rabin,et al.  Sliced and Radon Wasserstein Barycenters of Measures , 2014, Journal of Mathematical Imaging and Vision.

[15]  Bernhard Schölkopf,et al.  Wasserstein Auto-Encoders , 2017, ICLR.

[16]  Anthony Man-Cho So,et al.  Nonsmooth Optimization over Stiefel Manifold: Riemannian Subgradient Methods , 2019, ArXiv.

[17]  David van Dijk,et al.  TrajectoryNet: A Dynamic Optimal Transport Network for Modeling Cellular Dynamics , 2020, ICML.

[18]  Yang Zou,et al.  Sliced Wasserstein Kernels for Probability Distributions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[20]  Bastian Goldlücke,et al.  Variational Analysis , 2014, Computer Vision, A Reference Guide.

[21]  Julien Rabin,et al.  Wasserstein Barycenter and Its Application to Texture Mixing , 2011, SSVM.

[22]  James Zijun Wang,et al.  Fast Discrete Distribution Clustering Using Wasserstein Barycenter With Sparse Support , 2015, IEEE Transactions on Signal Processing.

[23]  M. Talagrand Transportation cost for Gaussian and other product measures , 1996 .

[24]  David Pollard,et al.  The minimum distance method of testing , 1980 .

[25]  Klaus-Robert Müller,et al.  Wasserstein Training of Restricted Boltzmann Machines , 2016, NIPS.

[26]  Saradha Venkatachalapathy,et al.  Predicting cell lineages using autoencoders and optimal transport , 2020, PLoS Comput. Biol..

[27]  Gabriel Peyré,et al.  Wasserstein barycentric coordinates , 2016, ACM Trans. Graph..

[28]  Jonas Adler,et al.  Banach Wasserstein GAN , 2018, NeurIPS.

[29]  A. Basu,et al.  Statistical Inference: The Minimum Distance Approach , 2011 .

[30]  Bertrand Thirion,et al.  Multi-subject MEG/EEG source imaging with sparse multi-task regression , 2019, NeuroImage.

[31]  Mingkui Tan,et al.  Multi-marginal Wasserstein GAN , 2019, NeurIPS.

[32]  Arnaud Doucet,et al.  Fast Computation of Wasserstein Barycenters , 2013, ICML.

[33]  L. Wasserman,et al.  Minimax confidence intervals for the Sliced Wasserstein distance , 2019, Electronic Journal of Statistics.

[34]  E. Giné,et al.  Central limit theorems for the wasserstein distance between the empirical and the true distributions , 1999 .

[35]  Marco Cuturi,et al.  On Wasserstein Two-Sample Testing and Related Families of Nonparametric Tests , 2015, Entropy.

[36]  Michael I. Jordan,et al.  Projection Robust Wasserstein Distance and Riemannian Optimization , 2020, NeurIPS.

[37]  Luc Van Gool,et al.  Sliced Wasserstein Generative Models , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Marco Cuturi,et al.  Subspace Robust Wasserstein distances , 2019, ICML.

[39]  Kim C. Border,et al.  Infinite Dimensional Analysis: A Hitchhiker’s Guide , 1994 .

[40]  M. Ledoux Concentration of measure and logarithmic Sobolev inequalities , 1999 .

[41]  Tommi S. Jaakkola,et al.  Learning Population-Level Diffusions with Generative RNNs , 2016, ICML.

[42]  Alexander G. Schwing,et al.  Generative Modeling Using the Sliced Wasserstein Distance , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[43]  Yoav Zemel,et al.  Statistical Aspects of Wasserstein Distances , 2018, Annual Review of Statistics and Its Application.

[44]  John P. Nolan,et al.  Multivariate elliptically contoured stable distributions: theory and estimation , 2013, Computational Statistics.

[45]  Dinh Q. Phung,et al.  Multilevel Clustering via Wasserstein Means , 2017, ICML.

[46]  Roland Badeau,et al.  Generalized Sliced Wasserstein Distances , 2019, NeurIPS.

[47]  J. Wolfowitz The Minimum Distance Method , 1957 .

[48]  Jonathan Niles-Weed,et al.  Estimation of Wasserstein distances in the Spiked Transport Model , 2019, Bernoulli.

[49]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[50]  Nicolas Bonnotte Unidimensional and Evolution Methods for Optimal Transportation , 2013 .

[51]  C. Villani Optimal Transport: Old and New , 2008 .

[52]  R. Dudley The Speed of Mean Glivenko-Cantelli Convergence , 1969 .

[53]  Jing Lei Convergence and concentration of empirical measures under Wasserstein distance in unbounded functional spaces , 2018, Bernoulli.

[54]  Barnabás Póczos,et al.  Minimax Distribution Estimation in Wasserstein Distance , 2018, ArXiv.

[55]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[56]  Sophie Dede,et al.  An empirical Central Limit Theorem in L1 for stationary sequences , 2008, 0812.2839.

[57]  Antoine Liutkus,et al.  Sliced-Wasserstein Flows: Nonparametric Generative Modeling via Optimal Transport and Diffusions , 2018, ICML.

[58]  David A. Forsyth,et al.  Max-Sliced Wasserstein Distance and Its Use for GANs , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Robert E. Mahony,et al.  Optimization Algorithms on Matrix Manifolds , 2007 .

[60]  Gustavo K. Rohde,et al.  Sliced Wasserstein Auto-Encoders , 2018, ICLR.

[61]  Shahin Shahrampour,et al.  Statistical and Topological Properties of Sliced Probability Divergences , 2020, NeurIPS.

[62]  Khai Nguyen,et al.  Distributional Sliced-Wasserstein and Applications to Generative Modeling , 2020, ICLR.

[63]  L. Brown,et al.  Measurable Selections of Extrema , 1973 .

[64]  F. Bach,et al.  Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance , 2017, Bernoulli.