Minibatch optimal transport distances; analysis and applications

Optimal transport distances have become a classic tool to compare probability distributions and have found many applications in machine learning. Yet, despite recent algorithmic developments, their complexity prevents their direct use on large scale datasets. To overcome this challenge, a common workaround is to compute these distances on minibatches i.e. to average the outcome of several smaller optimal transport problems. We propose in this paper an extended analysis of this practice, which effects were previously studied in restricted cases. We first consider a large variety of Optimal Transport kernels. We notably argue that the minibatch strategy comes with appealing properties such as unbiased estimators, gradients and a concentration bound around the expectation, but also with limits: the minibatch OT is not a distance. To recover some of the lost distance axioms, we introduce a debiased minibatch OT function and study its statistical and optimisation properties. Along with this theoretical analysis, we also conduct empirical experiments on gradient flows, generative adversarial networks (GANs) or color transfer that highlight the practical interest of this strategy.

[1]  M. V. D. Panne,et al.  Displacement Interpolation Using Lagrangian Mass Transport , 2011 .

[2]  F. Bach,et al.  Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance , 2017, Bernoulli.

[3]  Filippo Santambrogio,et al.  Optimal Transport for Applied Mathematicians , 2015 .

[4]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[5]  Tommi S. Jaakkola,et al.  Towards Optimal Transport with Global Invariances , 2018, AISTATS.

[6]  Knut-Andreas Lie,et al.  Scale Space and Variational Methods in Computer Vision , 2019, Lecture Notes in Computer Science.

[7]  Yan Zhang,et al.  On the Euclidean distance of images , 2005, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Gabriel Peyré,et al.  Learning Generative Models with Sinkhorn Divergences , 2017, AISTATS.

[9]  LinLin Shen,et al.  Deep Feature Consistent Variational Autoencoder , 2016, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[10]  D. Bertsekas Stochastic optimization problems with nondifferentiable cost functionals , 1973 .

[11]  Aude Genevay Entropy-regularized Optimal Transport for Machine Learning , 2019 .

[12]  Robert M. Gower,et al.  Stochastic algorithms for entropy-regularized optimal transport problems , 2018, AISTATS.

[13]  Mikael Johansson,et al.  Convergence of a Stochastic Gradient Method with Momentum for Nonsmooth Nonconvex Optimization , 2020, ICML.

[14]  Hossein Mobahi,et al.  Learning with a Wasserstein Loss , 2015, NIPS.

[15]  Gustavo K. Rohde,et al.  Sliced Wasserstein Auto-Encoders , 2018, ICLR.

[16]  Gabriel Peyré,et al.  Stochastic Optimization for Large-scale Optimal Transport , 2016, NIPS.

[17]  É. Moulines,et al.  Analysis of nonsmooth stochastic approximation: the differential inclusion approach , 2018, 1805.01916.

[18]  Nicolas Bonnotte Unidimensional and Evolution Methods for Optimal Transportation , 2013 .

[19]  Nicolas Courty,et al.  Optimal Transport for Domain Adaptation , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Nicolas Papadakis,et al.  Regularized Optimal Transport and the Rot Mover's Distance , 2016, J. Mach. Learn. Res..

[21]  Vivien Seguy,et al.  Smooth and Sparse Optimal Transport , 2017, AISTATS.

[22]  Mauro Maggioni,et al.  Multiscale Strategies for Computing Optimal Transport , 2017, J. Mach. Learn. Res..

[23]  Antoine Liutkus,et al.  Sliced-Wasserstein Flows: Nonparametric Generative Modeling via Optimal Transport and Diffusions , 2018, ICML.

[24]  Christian Ledig,et al.  Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[25]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[26]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[27]  Alain Trouvé,et al.  Interpolating between Optimal Transport and MMD using Sinkhorn Divergences , 2018, AISTATS.

[28]  Stefun D. Leigh U-Statistics Theory and Practice , 1992 .

[29]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[30]  Roland Badeau,et al.  Generalized Sliced Wasserstein Distances , 2019, NeurIPS.

[31]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[32]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[33]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[34]  Dmitriy Drusvyatskiy,et al.  Stochastic Subgradient Method Converges on Tame Functions , 2018, Foundations of Computational Mathematics.

[35]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[36]  Francis Bach,et al.  Stochastic Optimization for Regularized Wasserstein Estimators , 2020, ICML.

[37]  Yiming Yang,et al.  MMD GAN: Towards Deeper Understanding of Moment Matching Network , 2017, NIPS.

[38]  Nicolas Courty,et al.  DeepJDOT: Deep Joint distribution optimal transport for unsupervised domain adaptation , 2018, ECCV.

[39]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[40]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[41]  Gabriel Peyré,et al.  Entropic Approximation of Wasserstein Gradient Flows , 2015, SIAM J. Imaging Sci..

[42]  R. Dudley The Speed of Mean Glivenko-Cantelli Convergence , 1969 .

[43]  Facundo Mémoli,et al.  Gromov–Wasserstein Distances and the Metric Approach to Object Matching , 2011, Found. Comput. Math..

[44]  Christian P. Robert,et al.  On parameter estimation with the Wasserstein distance , 2017, Information and Inference: A Journal of the IMA.

[45]  F. Santambrogio Optimal Transport for Applied Mathematicians: Calculus of Variations, PDEs, and Modeling , 2015 .

[46]  R. Gribonval,et al.  Learning with minibatch Wasserstein : asymptotic and gradient properties , 2019, AISTATS.

[47]  Michael I. Jordan,et al.  A Short Note on Concentration Inequalities for Random Vectors with SubGaussian Norm , 2019, ArXiv.

[48]  Heiko Hoffmann,et al.  Sliced Wasserstein Distance for Learning Gaussian Mixture Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[49]  Frank Nielsen,et al.  Sinkhorn AutoEncoders , 2018, UAI.

[50]  Nicolas Courty,et al.  Large Scale Optimal Transport and Mapping Estimation , 2017, ICLR.

[51]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[52]  Gabriel Peyré,et al.  Sample Complexity of Sinkhorn Divergences , 2018, AISTATS.

[53]  Stéphan Clémençon,et al.  SGD Algorithms based on Incomplete U-statistics: Large-Scale Minimization of Empirical Risk , 2015, NIPS.

[54]  Stéphan Clémençon,et al.  Scaling-up Empirical Risk Minimization: Optimization of Incomplete $U$-statistics , 2015, J. Mach. Learn. Res..

[55]  Arthur Gretton,et al.  Demystifying MMD GANs , 2018, ICLR.

[56]  Marc G. Bellemare,et al.  The Cramer Distance as a Solution to Biased Wasserstein Gradients , 2017, ArXiv.

[57]  Han Zhang,et al.  Improving GANs Using Optimal Transport , 2018, ICLR.

[58]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[59]  J. Zico Kolter,et al.  Wasserstein Adversarial Examples via Projected Sinkhorn Iterations , 2019, ICML.

[60]  Nicolas Courty,et al.  Sliced Gromov-Wasserstein , 2019, NeurIPS.

[61]  Stefanie Jegelka,et al.  Learning Generative Models across Incomparable Spaces , 2019, ICML.

[62]  Jason Altschuler,et al.  Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration , 2017, NIPS.

[63]  Vladimir G. Kim,et al.  Entropic metric alignment for correspondence problems , 2016, ACM Trans. Graph..

[64]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[65]  Yang Zou,et al.  Sliced Wasserstein Kernels for Probability Distributions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[67]  Kenji Fukumizu,et al.  Tree-Sliced Variants of Wasserstein Distances , 2019, NeurIPS.

[68]  Achim Klenke,et al.  Probability theory - a comprehensive course , 2008, Universitext.

[69]  Gabriel Peyré,et al.  Computational Optimal Transport , 2018, Found. Trends Mach. Learn..

[70]  Eva L. Dyer,et al.  Hierarchical Optimal Transport for Multimodal Distribution Alignment , 2019, NeurIPS.

[71]  Axel Munk,et al.  Optimal Transport: Fast Probabilistic Approximation with Exact Solvers , 2018, J. Mach. Learn. Res..

[72]  F. Clarke Optimization And Nonsmooth Analysis , 1983 .

[73]  Gabriel Peyré,et al.  Gromov-Wasserstein Averaging of Kernel and Distance Matrices , 2016, ICML.

[74]  Luc Van Gool,et al.  Sliced Wasserstein Generative Models , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).