Supervised Tree-Wasserstein Distance

To measure the similarity of documents, the Wasserstein distance is a powerful tool, but it requires a high computational cost. Recently, for fast computation of the Wasserstein distance, methods for approximating the Wasserstein distance using a tree metric have been proposed. These tree-based methods allow fast comparisons of a large number of documents; however, they are unsupervised and do not learn task-specific distances. In this work, we propose the Supervised Tree-Wasserstein (STW) distance, a fast, supervised metric learning method based on the tree metric. Specifically, we rewrite the Wasserstein distance on the tree metric by the parent–child relationships of a tree, and formulate it as a continuous optimization problem using a contrastive loss. Experimentally, we show that the STW distance can be computed fast, and improves the accuracy of document classification tasks. Furthermore, the STW distance is formulated by matrix multiplications, runs on a GPU, and is suitable for batch processing. Therefore, we show that the STW distance is extremely efficient when comparing a large number of documents.

[1]  Julien Rabin,et al.  Wasserstein Barycenter and Its Application to Texture Mixing , 2011, SSVM.

[2]  Douwe Kiela,et al.  Poincaré Embeddings for Learning Hierarchical Representations , 2017, NIPS.

[3]  Walid Krichene,et al.  Rankmax: An Adaptive Projection Alternative to the Softmax Function , 2020, NeurIPS.

[4]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[5]  Hisashi Kashima,et al.  Fast Unbalanced Optimal Transport on Tree , 2020, ArXiv.

[6]  David A. Forsyth,et al.  Max-Sliced Wasserstein Distance and Its Use for GANs , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[7]  Andrew McCallum,et al.  Gradient-based Hierarchical Clustering using Continuous Representations of Trees in Hyperbolic Space , 2019, KDD.

[8]  Thomas Hofmann,et al.  Hyperbolic Entailment Cones for Learning Hierarchical Embeddings , 2018, ICML.

[9]  Matt J. Kusner,et al.  Supervised Word Mover's Distance , 2016, NIPS.

[10]  Gustavo K. Rohde,et al.  Sliced Wasserstein Auto-Encoders , 2018, ICLR.

[11]  Justin Solomon,et al.  Hierarchical Optimal Transport for Document Representation , 2019, NeurIPS.

[12]  Michael Werman,et al.  Fast and robust Earth Mover's Distances , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[13]  Heiko Hoffmann,et al.  Sliced Wasserstein Distance for Learning Gaussian Mixture Models , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[14]  Jens Vygen,et al.  The Book Review Column1 , 2020, SIGACT News.

[15]  Albert Gu,et al.  From Trees to Continuous Embeddings and Back: Hyperbolic Hierarchical Clustering , 2020, NeurIPS.

[16]  Thomas Villmann,et al.  Applications of lp-Norms and their Smooth Approximations for Gradient Based Learning Vector Quantization , 2014, ESANN.

[17]  Kubilay Atasu,et al.  Linear-Complexity Data-Parallel Earth Mover's Distance Approximations , 2019, ICML.

[18]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[19]  Tomasz Malisiewicz,et al.  SuperGlue: Learning Feature Matching With Graph Neural Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21]  Makoto Yamada,et al.  Semantic Correspondence as an Optimal Transport Problem , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[22]  Roland Badeau,et al.  Generalized Sliced Wasserstein Distances , 2019, NeurIPS.

[23]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[24]  Kenji Fukumizu,et al.  Tree-Sliced Variants of Wasserstein Distances , 2019, NeurIPS.

[25]  Ramón Fernández Astudillo,et al.  From Softmax to Sparsemax: A Sparse Model of Attention and Multi-Label Classification , 2016, ICML.

[26]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[27]  Piotr Indyk,et al.  Scalable Nearest Neighbor Search for Optimal Transport , 2019, ICML.

[28]  Yang You,et al.  Large Batch Training of Convolutional Networks , 2017, 1708.03888.

[29]  Sanjoy Dasgupta,et al.  A cost function for similarity-based hierarchical clustering , 2015, STOC.