The ability to compare two degenerate probability distributions (i.e. two probability distributions supported on two distinct low-dimensional manifolds living in a much higher-dimensional space) is a crucial problem arising in the estimation of generative models for high-dimensional observations such as those arising in computer vision or natural language. It is known that optimal transport metrics can represent a cure for this problem, since they were specifically designed as an alternative to information divergences to handle such problematic scenarios. Unfortunately, training generative machines using OT raises formidable computational and statistical challenges, because of (i) the computational burden of evaluating OT losses, (ii) the instability and lack of smoothness of these losses, (iii) the difficulty to estimate robustly these losses and their gradients in high dimension. This paper presents the first tractable computational method to train large scale generative models using an optimal transport loss, and tackles both these issues by relying on two key ideas: (a) entropic smoothing, which turns the original OT loss into one that can be computed using Sinkhorn fixed point iterations; (b) algorithmic (automatic) differentiation of these iterations. These two approximations result in a robust and differentiable approximation of the OT loss with streamlined GPU execution. The resulting computational architecture complements nicely standard deep network generative models by a stack of extra layers implementing the loss function.
[1]
Max Welling,et al.
Auto-Encoding Variational Bayes
,
2013,
ICLR.
[2]
Aaron C. Courville,et al.
Improved Training of Wasserstein GANs
,
2017,
NIPS.
[3]
Gabriel Peyré,et al.
Stochastic Optimization for Large-scale Optimal Transport
,
2016,
NIPS.
[4]
Yiming Yang,et al.
MMD GAN: Towards Deeper Understanding of Moment Matching Network
,
2017,
NIPS.
[5]
F. Bach,et al.
Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance
,
2017,
Bernoulli.
[6]
O. Bousquet,et al.
From optimal transport to generative modeling: the VEGAN cookbook
,
2017,
1705.07642.
[7]
Christian P. Robert,et al.
On parameter estimation with the Wasserstein distance
,
2017,
Information and Inference: A Journal of the IMA.
[8]
Jimmy Ba,et al.
Adam: A Method for Stochastic Optimization
,
2014,
ICLR.