Learning with a Wasserstein Loss

Learning to predict multi-label outputs is challenging, but in many problems there is a natural metric on the outputs that can be used to improve predictions. In this paper we develop a loss function for multi-label learning, based on the Wasserstein distance. The Wasserstein distance provides a natural notion of dissimilarity for probability measures. Although optimizing with respect to the exact Wasserstein distance is costly, recent work has described a regularized approximation that is efficiently computed. We describe an efficient learning algorithm based on this regularization, as well as a novel extension of the Wasserstein distance from probability measures to unnormalized measures. We also describe a statistical learning bound for the loss. The Wasserstein loss can encourage smoothness of the predictions with respect to a chosen metric on the output space. We demonstrate this property on a real-data tag prediction problem, using the Yahoo Flickr Creative Commons dataset, outperforming a baseline that doesn't use the metric.

[1]  M. Levandowsky,et al.  Distance between Sets , 1971, Nature.

[2]  C. Givens,et al.  A class of Wasserstein metrics for probability distributions. , 1984 .

[3]  M. Talagrand,et al.  Probability in Banach Spaces: Isoperimetry and Processes , 1991 .

[4]  L. Rudin,et al.  Nonlinear total variation based noise removal algorithms , 1992 .

[5]  John N. Tsitsiklis,et al.  Introduction to linear optimization , 1997, Athena scientific optimization and computation series.

[6]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[7]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[8]  Trevor Darrell,et al.  Fast contour matching using approximate earth mover's distance , 2004, Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004..

[9]  F. Bassetti,et al.  On minimum Kantorovich distance estimators , 2006 .

[10]  Daniel Ruiz,et al.  A Fast Algorithm for Matrix Balancing , 2013, Web Information Retrieval and Linear Algebra Algorithms.

[11]  David W. Jacobs,et al.  Approximate earth mover’s distance in linear time , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[12]  C. Villani Optimal Transport: Old and New , 2008 .

[13]  Michael Werman,et al.  Fast and robust Earth Mover's Distances , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[14]  M. Hidayath Ansari,et al.  Comparing Clusterings in Space , 2010, ICML.

[15]  Neil D. Lawrence,et al.  Kernels for Vector-Valued Functions: a Review , 2011, Found. Trends Mach. Learn..

[16]  V. Bogachev,et al.  The Monge-Kantorovich problem: achievements, connections, and perspectives , 2012 .

[17]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[18]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[19]  Herbert Edelsbrunner,et al.  Persistent Homology: Theory and Practice , 2013 .

[20]  Leonidas J. Guibas,et al.  Wasserstein Propagation for Semi-Supervised Learning , 2014, ICML.

[21]  Arnaud Doucet,et al.  Fast Computation of Wasserstein Barycenters , 2013, ICML.

[22]  Sebastian Nowozin,et al.  Optimal Decisions from Probabilistic Models: The Intersection-over-Union Case , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[23]  Andrea Vedaldi,et al.  MatConvNet: Convolutional Neural Networks for MATLAB , 2014, ACM Multimedia.

[24]  G. Peyré,et al.  Unbalanced Optimal Transport: Geometry and Kantorovich Formulation , 2015 .

[25]  Iasonas Kokkinos,et al.  Semantic Image Segmentation with Deep Convolutional Nets and Fully Connected CRFs , 2014, ICLR.

[26]  Trevor Darrell,et al.  Fully convolutional networks for semantic segmentation , 2015, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[27]  David A. Shamma,et al.  The New Data and New Challenges in Multimedia Research , 2015, ArXiv.

[28]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[29]  Gabriel Peyré,et al.  A Smoothed Dual Approach for Variational Wasserstein Problems , 2015, SIAM J. Imaging Sci..