Learning with minibatch Wasserstein : asymptotic and gradient properties

Optimal transport distances are powerful tools to compare probability distributions and have found many applications in machine learning. Yet their algorithmic complexity prevents their direct use on large scale datasets. To overcome this challenge, practitioners compute these distances on minibatches {\em i.e.} they average the outcome of several smaller optimal transport problems. We propose in this paper an analysis of this practice, which effects are not well understood so far. We notably argue that it is equivalent to an implicit regularization of the original problem, with appealing properties such as unbiased estimators, gradients and a concentration bound around the expectation, but also with defects such as loss of distance property. Along with this theoretical analysis, we also conduct empirical experiments on gradient flows, GANs or color transfer that highlight the practical interest of this strategy.

[1]  Stefun D. Leigh U-Statistics Theory and Practice , 1992 .

[2]  Xiaogang Wang,et al.  Deep Learning Face Attributes in the Wild , 2014, 2015 IEEE International Conference on Computer Vision (ICCV).

[3]  G. Lugosi,et al.  Ranking and empirical minimization of U-statistics , 2006, math/0603123.

[4]  Hossein Mobahi,et al.  Learning with a Wasserstein Loss , 2015, NIPS.

[5]  Jonathan J. Hull,et al.  A Database for Handwritten Text Recognition Research , 1994, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[7]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[8]  Gabriel Peyré,et al.  Computational Optimal Transport , 2018, Found. Trends Mach. Learn..

[9]  Alfred M. Bruckstein,et al.  Scale Space and Variational Methods in Computer Vision , 2011, Lecture Notes in Computer Science.

[10]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[11]  Nicolas Bonnotte Unidimensional and Evolution Methods for Optimal Transportation , 2013 .

[12]  F. Bassetti,et al.  On minimum Kantorovich distance estimators , 2006 .

[13]  Antoine Liutkus,et al.  Sliced-Wasserstein Flows: Nonparametric Generative Modeling via Optimal Transport and Diffusions , 2018, ICML.

[14]  Stéphan Clémençon,et al.  SGD Algorithms based on Incomplete U-statistics: Large-Scale Minimization of Empirical Risk , 2015, NIPS.

[15]  Frank Nielsen,et al.  Sinkhorn AutoEncoders , 2018, UAI.

[16]  C. Robert,et al.  Inference in generative models using the Wasserstein distance , 2017, 1701.05146.

[17]  Nicolas Courty,et al.  Large Scale Optimal Transport and Mapping Estimation , 2017, ICLR.

[18]  Marc G. Bellemare,et al.  The Cramer Distance as a Solution to Biased Wasserstein Gradients , 2017, ArXiv.

[19]  M. V. D. Panne,et al.  Displacement Interpolation Using Lagrangian Mass Transport , 2011 .

[20]  Vivien Seguy,et al.  Smooth and Sparse Optimal Transport , 2017, AISTATS.

[21]  Gabriel Peyré,et al.  Learning Generative Models with Sinkhorn Divergences , 2017, AISTATS.

[22]  Gabriel Peyré,et al.  Entropic Approximation of Wasserstein Gradient Flows , 2015, SIAM J. Imaging Sci..

[23]  Stéphan Clémençon,et al.  Scaling-up Empirical Risk Minimization: Optimization of Incomplete $U$-statistics , 2015, J. Mach. Learn. Res..

[24]  Nicolas Courty,et al.  Optimal Transport for Domain Adaptation , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Yoshua Bengio,et al.  Generative Adversarial Nets , 2014, NIPS.

[26]  Alain Trouvé,et al.  Interpolating between Optimal Transport and MMD using Sinkhorn Divergences , 2018, AISTATS.

[27]  Yang Zou,et al.  Sliced Wasserstein Kernels for Probability Distributions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28]  Aaron C. Courville,et al.  Improved Training of Wasserstein GANs , 2017, NIPS.

[29]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[30]  Gabriel Peyré,et al.  Sample Complexity of Sinkhorn Divergences , 2018, AISTATS.

[31]  Stéphan Clémençon,et al.  Maximal Deviations of Incomplete U-statistics with Applications to Empirical Risk Sampling , 2013, SDM.

[32]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[33]  Mauro Maggioni,et al.  Multiscale Strategies for Computing Optimal Transport , 2017, J. Mach. Learn. Res..

[34]  Axel Munk,et al.  Optimal Transport: Fast Probabilistic Approximation with Exact Solvers , 2018, J. Mach. Learn. Res..

[35]  Wolfgang Heidrich,et al.  Displacement interpolation using Lagrangian mass transport , 2011, ACM Trans. Graph..

[36]  Gabriel Peyré,et al.  Stochastic Optimization for Large-scale Optimal Transport , 2016, NIPS.

[37]  Luc Van Gool,et al.  Sliced Wasserstein Generative Models , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[38]  Nicolas Courty,et al.  DeepJDOT: Deep Joint distribution optimal transport for unsupervised domain adaptation , 2018, ECCV.

[39]  Stefanie Jegelka,et al.  Learning Generative Models across Incomparable Spaces , 2019, ICML.

[40]  Julien Rabin,et al.  Regularized Discrete Optimal Transport , 2013, SIAM J. Imaging Sci..

[41]  Stéphan J. Clémençcon On U-processes and clustering performance , 2011, NIPS 2011.

[42]  F. Bach,et al.  Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance , 2017, Bernoulli.