Re-evaluating Word Mover's Distance

The word mover’s distance (WMD) is a fundamental technique for measuring the similarity of two documents. As the crux of WMD, it can take advantage of the underlying geometry of the word space by employing an optimal transport formulation. The original study on WMD reported that WMD outperforms classical baselines such as bag-of-words (BOW) and TF-IDF by signif-icant margins in various datasets. In this paper, we point out that the evaluation in the original study could be misleading. We re-evaluate the performances of WMD and the classical baselines and find that the classical baselines are competitive with WMD if we employ an appropriate preprocessing, i.e., L1 normalization. In addi-tion, We introduce an analogy between WMD and L1-normalized BOW and find that not only the performance of WMD but also the distance values resemble those of BOW in high dimensional spaces.

[1]  Antoni B. Chan,et al.  On Diversity in Image Captioning: Metrics and Methods , 2020, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[2]  Ryoma Sato,et al.  Supervised Tree-Wasserstein Distance , 2021, ICML.

[3]  Graham Neubig,et al.  Word Alignment by Fine-tuning Embeddings on Parallel Corpora , 2021, EACL.

[4]  Shiqian Ma,et al.  A Riemannian Block Coordinate Descent Method for Computing the Projection Robust Wasserstein Distance , 2020, ICML.

[5]  Hua Wu,et al.  RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering , 2020, NAACL.

[6]  Liqun Chen,et al.  Improving Text Generation with Student-Forcing Optimal Transport , 2020, EMNLP.

[7]  Rémi Emonet,et al.  A Swiss Army Knife for Minimax Optimal Transport , 2020, ICML.

[8]  Michael I. Jordan,et al.  Projection Robust Wasserstein Distance and Riemannian Optimization , 2020, NeurIPS.

[9]  Hisashi Kashima,et al.  Fast Unbalanced Optimal Transport on Tree , 2020, ArXiv.

[10]  Yao-Hung Hubert Tsai,et al.  Feature Robust Optimal Transport for High-dimensional Data , 2020, ECML/PKDD.

[11]  Richard Peng,et al.  A Study of Performance of Optimal Transport , 2020, ArXiv.

[12]  Wei Zhao,et al.  SUPERT: Towards New Frontiers in Unsupervised Evaluation Metrics for Multi-Document Summarization , 2020, ACL.

[13]  Wei Zhao,et al.  On the Limitations of Cross-lingual Encoders as Exposed by Reference-Free Machine Translation Evaluation , 2020, ACL.

[14]  Kentaro Inui,et al.  Word Rotator’s Distance , 2020, EMNLP.

[15]  M. Zaharia,et al.  ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT , 2020, SIGIR.

[16]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[17]  Praneeth Netrapalli,et al.  P-SIF: Document Embeddings Using Partition Averaging , 2020, AAAI.

[18]  Guoyin Wang,et al.  Sequence Generation with Optimal-Transport-Enhanced Reinforcement Learning , 2020, AAAI.

[19]  Hisashi Kashima,et al.  Fast and Robust Comparison of Probability Measures in Heterogeneous Spaces , 2020, ArXiv.

[20]  Zhe Gan,et al.  Nested-Wasserstein Self-Imitation Learning for Sequence Generation , 2020, AISTATS.

[21]  A. Micheli,et al.  A Fair Comparison of Graph Neural Networks for Graph Classification , 2019, ICLR.

[22]  E. Laber,et al.  Speeding up Word Mover's Distance and its variants via properties of distances between embeddings , 2019, ECAI.

[23]  Piotr Indyk,et al.  Scalable Nearest Neighbor Search for Optimal Transport , 2019, ICML.

[24]  Michalis Vazirgiannis,et al.  Message Passing Attention Networks for Document Understanding , 2019, AAAI.

[25]  Mohammed J. Zaki,et al.  Reinforcement Learning Based Graph-to-Sequence Model for Natural Question Generation , 2019, ICLR.

[26]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[27]  Michalis Vazirgiannis,et al.  Rep the Set: Neural Networks for Learning Set Representations , 2019, AISTATS.

[28]  Zihao Wang,et al.  Robust Document Distance with Wasserstein-Fisher-Rao metric , 2020, ACML.

[29]  Lawrence Carin,et al.  Semantic Matching via Optimal Partial Transport , 2020, EMNLP.

[30]  Michalis Vazirgiannis,et al.  Boosting Tricks for Word Mover's Distance , 2020, ICANN.

[31]  Noemi Mauro,et al.  Performance comparison of neural and non-neural approaches to session-based recommendation , 2019, RecSys.

[32]  Dirk Krechel,et al.  Balanced Word Clusters for Interpretable Document Representation , 2019, 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW).

[33]  Fei Liu,et al.  MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance , 2019, EMNLP.

[34]  Dietmar Jannach,et al.  Are we really making much progress? A worrying analysis of recent neural recommendation approaches , 2019, RecSys.

[35]  Justin Solomon,et al.  Hierarchical Optimal Transport for Document Representation , 2019, NeurIPS.

[36]  Noah A. Smith,et al.  Sentence Mover’s Similarity: Automatic Evaluation for Multi-Sentence Texts , 2019, ACL.

[37]  Jihong Ouyang,et al.  Classifying Extremely Short Texts by Exploiting Semantic Centroids in Word Mover's Distance Space , 2019, WWW.

[38]  Justin Solomon,et al.  Learning Embeddings into Entropic Wasserstein Spaces , 2019, ICLR.

[39]  P. Rigollet,et al.  Optimal-Transport Analysis of Single-Cell Gene Expression Identifies Developmental Trajectories in Reprogramming , 2019, Cell.

[40]  Kenji Fukumizu,et al.  Tree-Sliced Variants of Wasserstein Distances , 2019, NeurIPS.

[41]  Roland Badeau,et al.  Generalized Sliced Wasserstein Distances , 2019, NeurIPS.

[42]  Marco Cuturi,et al.  Subspace Robust Wasserstein distances , 2019, ICML.

[43]  Zhe Gan,et al.  Improving Sequence-to-Sequence Learning via Optimal Transport , 2019, ICLR.

[44]  Jimmy J. Lin,et al.  The Neural Hype and Comparisons Against Weak Baselines , 2019, SIGIR Forum.

[45]  Alexandros G. Dimakis,et al.  Discrete Adversarial Attacks and Submodular Optimization with Applications to Text Classification , 2018, MLSys.

[46]  Gabriel Peyré,et al.  Sample Complexity of Sinkhorn Divergences , 2018, AISTATS.

[47]  Martin Jaggi,et al.  Context Mover's Distance & Barycenters: Optimal transport of contexts for building representations , 2018, DGS@ICLR.

[48]  Edouard Grave,et al.  Unsupervised Alignment of Embeddings with Wasserstein Procrustes , 2018, AISTATS.

[49]  Lantao Yu,et al.  CoT: Cooperative Training for Generative Modeling of Discrete Data , 2018, ICML.

[50]  F. Bach,et al.  Sharp asymptotic and finite-sample rates of convergence of empirical measures in Wasserstein distance , 2017, Bernoulli.

[51]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[52]  Pradeep Ravikumar,et al.  Word Mover’s Embedding: From Word2Vec to Document Embedding , 2018, EMNLP.

[53]  Wei Liu,et al.  Distilled Wasserstein Learning for Word Embedding and Topic Modeling , 2018, NeurIPS.

[54]  Xuanjing Huang,et al.  Reinforced Evolutionary Neural Architecture Search , 2018, ArXiv.

[55]  Guoyin Wang,et al.  Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms , 2018, ACL.

[56]  Marco Cuturi,et al.  Generalizing Point Embeddings using the Wasserstein Space of Elliptical Distributions , 2018, NeurIPS.

[57]  Han Zhang,et al.  Improving GANs Using Optimal Transport , 2018, ICLR.

[58]  D. Sculley,et al.  Winner's Curse? On Pace, Progress, and Empirical Rigor , 2018, ICLR.

[59]  Tommi S. Jaakkola,et al.  Structured Optimal Transport , 2018, AISTATS.

[60]  Gabriel Peyré,et al.  Learning Generative Models with Sinkhorn Divergences , 2017, AISTATS.

[61]  Michel Deudon,et al.  Learning semantic similarity in a continuous space , 2018, NeurIPS.

[62]  Martin Trapp,et al.  Retrieving Compositional Documents Using Position-Sensitive Word Mover's Distance , 2017, ICTIR.

[63]  Meng Zhang,et al.  Earth Mover’s Distance Minimization for Unsupervised Bilingual Lexicon Induction , 2017, EMNLP.

[64]  Shourya Roy,et al.  Earth Mover's Distance Pooling over Siamese LSTMs for Automatic Short Answer Grading , 2017, IJCAI.

[65]  Léon Bottou,et al.  Wasserstein Generative Adversarial Networks , 2017, ICML.

[66]  Paul Michel,et al.  Does the Geometry of Word Embeddings Help Document Classification? A Case Study on Persistent Homology-Based Representations , 2017, Rep4NLP@ACL.

[67]  Sanjeev Arora,et al.  A Simple but Tough-to-Beat Baseline for Sentence Embeddings , 2017, ICLR.

[68]  Yannis Stavrakas,et al.  Multivariate Gaussian Document Representation from Word Embeddings for Text Categorization , 2017, EACL.

[69]  Alexandros Kalousis,et al.  Regularising Non-linear Models Using Feature Side-information , 2017, ICML.

[70]  Meng Zhang,et al.  Bilingual Lexicon Induction from Non-Parallel Data with Minimal Supervision , 2017, AAAI.

[71]  John P. A. Ioannidis,et al.  A manifesto for reproducible science , 2017, Nature Human Behaviour.

[72]  Mert Kilickaya,et al.  Re-evaluating Automatic Metrics for Image Captioning , 2016, EACL.

[73]  Matt J. Kusner,et al.  Supervised Word Mover's Distance , 2016, NIPS.

[74]  John P. A. Ioannidis,et al.  What does research reproducibility mean? , 2016, Science Translational Medicine.

[75]  Gabriel Peyré,et al.  Stochastic Optimization for Large-scale Optimal Transport , 2016, NIPS.

[76]  Gabriel Peyré,et al.  Fast Dictionary Learning with a Smoothed Wasserstein Loss , 2016, AISTATS.

[77]  Meng Zhang,et al.  Building Earth Mover's Distance on Bilingual Word Embeddings for Machine Translation , 2016, AAAI.

[78]  Yang Zou,et al.  Sliced Wasserstein Kernels for Probability Distributions , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[79]  Sanjeev Arora,et al.  A Latent Variable Model Approach to PMI-based Word Embeddings , 2015, TACL.

[80]  Hayato Kobayashi,et al.  Summarization Based on Embedding Distributions , 2015, EMNLP.

[81]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[82]  Filippo Santambrogio,et al.  Optimal Transport for Applied Mathematicians , 2015 .

[83]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[84]  F. Collins,et al.  Policy: NIH plans to enhance reproducibility , 2014, Nature.

[85]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[86]  Mathieu Desbrun,et al.  Blue noise through optimal transport , 2012, ACM Trans. Graph..

[87]  S. Evans,et al.  The phylogenetic Kantorovich–Rubinstein metric for environmental sequence samples , 2010, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[88]  Julien Rabin,et al.  Wasserstein Barycenter and Its Application to Texture Mixing , 2011, SSVM.

[89]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[90]  Alistair Moffat,et al.  Improvements that don't add up: ad-hoc retrieval results since 1998 , 2009, CIKM.

[91]  Xavier Bresson,et al.  Local Histogram Based Segmentation Using the Wasserstein Distance , 2009, International Journal of Computer Vision.

[92]  Derek Greene,et al.  Practical solutions to the problem of diagonal dominance in kernel document clustering , 2006, ICML.

[93]  R. Knight,et al.  UniFrac: a New Phylogenetic Method for Comparing Microbial Communities , 2005, Applied and Environmental Microbiology.

[94]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[95]  Thorsten Joachims,et al.  Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[96]  Ken Lang,et al.  NewsWeeder: Learning to Filter Netnews , 1995, ICML.

[97]  R. Dudley The Speed of Mean Glivenko-Cantelli Convergence , 1969 .