A Survey on Optimal Transport for Machine Learning: Theory and Applications

Optimal Transport (OT) theory has seen an increasing amount of attention from the computer science community due to its potency and relevance in modeling and machine learning. It introduces means that serve as powerful ways to compare probability distributions with each other, as well as producing optimal mappings to minimize cost functions. Therefor, it has been deployed in computer vision, improving image retrieval, image interpolation, and semantic correspondence algorithms, as well as other fields such as domain adaptation, natural language processing, and variational inference. In this survey, we propose to convey the emerging promises of the optimal transport methods across various fields, as well as future directions of study for OT in machine learning. We will begin by looking at the history of optimal transport and introducing the founders of this field. We then give a brief glance into the algorithms related to OT. Then, we will follow up with a mathematical formulation and the prerequisites to understand OT, these include Kantorovich duality, entropic regularization, KL Divergence, and Wassertein barycenters. Since OT is a computationally expensive problem, we then introduce the entropy-regularized version of computing optimal mappings, which allowed OT problems to become applicable in a wide range of machine learning problems. In fact, the methods generated from OT theory are competitive with the current state-of-the-art methods. The last portion of this survey will analyze papers that focus on the application of OT within the context of machine learning. We first cover computer vision problems; these include GANs, semantic correspondence, and convolutional Wasserstein distances. Furthermore, we follow this up by breaking down research papers that focus on graph learning, neural architecture search, document representation, and domain adaptation. We close the paper with a small section on future research. Of the recommendations presented, three main problems are fundamental to allow OT to become widely applicable but rely strongly on its mathematical formulation and thus are hardest to answer. Since OT is a novel method, there is plenty of space for new research, and with more and more competitive methods (either on an accuracy level or computational speed level) being created, the future of applied optimal transport is bright as it has become pervasive in machine learning.

[1]  Zhe Gan,et al.  Improving Sequence-to-Sequence Learning via Optimal Transport , 2019, ICLR.

[2]  Kirthevasan Kandasamy,et al.  Neural Architecture Search with Bayesian Optimisation and Optimal Transport , 2018, NeurIPS.

[3]  Jamal Atif,et al.  Handling Multiple Costs in Optimal Transport: Strong Duality and Efficient Computation , 2020, ArXiv.

[4]  Filippo Santambrogio,et al.  Optimal Transport for Applied Mathematicians , 2015 .

[5]  P. Bernard,et al.  Optimal mass transportation and Mather theory , 2004, math/0412299.

[6]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[7]  Michael Werman,et al.  A Unified Approach to the Change of Resolution: Space and Gray-Level , 1989, IEEE Trans. Pattern Anal. Mach. Intell..

[8]  J. Mather,et al.  Minimal measures , 1989 .

[9]  R. McCann,et al.  The geometry of shape recognition via the monge-kantorovich optimal transport problem , 2004 .

[10]  Leonidas J. Guibas,et al.  Earth mover's distances on discrete surfaces , 2014, ACM Trans. Graph..

[11]  C. Villani Optimal Transport: Old and New , 2008 .

[12]  Marco Cuturi,et al.  Optimal Transport meets Probability, Statistics and Machine Learning , 2017 .

[13]  Allen Tannenbaum,et al.  On the Monge-Kantorovich problem and image warping , 2003 .

[14]  Gabriel Peyré,et al.  Computational Optimal Transport , 2018, Found. Trends Mach. Learn..

[15]  Navid Naderializadeh,et al.  Wasserstein Embedding for Graph Learning , 2020, ICLR.

[16]  S. Varadhan On the behavior of the fundamental solution of the heat equation with variable coefficients , 2010 .

[17]  Leonidas J. Guibas,et al.  Wasserstein Propagation for Semi-Supervised Learning , 2014, ICML.

[18]  Han Zhang,et al.  Improving GANs Using Optimal Transport , 2018, ICLR.

[19]  Gabriel Peyré,et al.  Learning Generative Models with Sinkhorn Divergences , 2017, AISTATS.

[20]  Justin Solomon,et al.  Hierarchical Optimal Transport for Document Representation , 2019, NeurIPS.

[21]  Arnaud Doucet,et al.  Fast Computation of Wasserstein Barycenters , 2013, ICML.

[22]  Wen Li,et al.  Semi-Supervised Optimal Transport for Heterogeneous Domain Adaptation , 2018, IJCAI.

[23]  Guillaume Carlier,et al.  Barycenters in the Wasserstein Space , 2011, SIAM J. Math. Anal..

[24]  Jim Freeman Probability Metrics and the Stability of Stochastic Models , 1991 .

[25]  Gabriel Peyré,et al.  Stochastic Optimization for Large-scale Optimal Transport , 2016, NIPS.

[26]  Marc G. Bellemare,et al.  The Cramer Distance as a Solution to Biased Wasserstein Gradients , 2017, ArXiv.

[27]  Nicolas Courty,et al.  Joint distribution optimal transportation for domain adaptation , 2017, NIPS.

[28]  Arthur Cayley,et al.  The Collected Mathematical Papers: On Monge's “Mémoire sur la théorie des déblais et des remblais” , 2009 .

[29]  C. Villani Topics in Optimal Transportation , 2003 .

[30]  Shing-Tung Yau,et al.  A Geometric View of Optimal Transportation and Generative Model , 2017, Comput. Aided Geom. Des..

[31]  Nicolas Courty,et al.  Optimal Transport for Domain Adaptation , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[32]  Hongyuan Zha,et al.  Gromov-Wasserstein Learning for Graph Matching and Node Embedding , 2019, ICML.

[33]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[34]  François-Xavier Vialard An elementary introduction to entropic regularization and proximal methods for numerical optimal transport , 2019 .

[35]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[36]  Lei Zhu,et al.  Optimal Mass Transport for Registration and Warping , 2004, International Journal of Computer Vision.

[37]  Ievgen Redko,et al.  Theoretical Analysis of Domain Adaptation with Optimal Transport , 2016, ECML/PKDD.

[38]  Boris Thibert,et al.  Optimal transport: discretization and algorithms , 2020, Geometric Partial Differential Equations - Part II.

[39]  Stefanie Jegelka,et al.  Learning Generative Models across Incomparable Spaces , 2019, ICML.

[40]  W. Gangbo,et al.  Shape recognition via Wasserstein distance , 2000 .

[41]  Luca Ambrogioni,et al.  Wasserstein Variational Inference , 2018, NeurIPS.

[42]  Léon Bottou,et al.  Wasserstein GAN , 2017, ArXiv.

[43]  Makoto Yamada,et al.  Semantic Correspondence as an Optimal Transport Problem , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[44]  Michael Lindenbaum,et al.  Nonnegative Matrix Factorization with Earth Mover's Distance Metric for Image Analysis , 2011, IEEE Transactions on Pattern Analysis and Machine Intelligence.