A Data Dependent Algorithm for Querying Earth Mover's Distance with Low Doubling Dimension

In this paper, we consider the following query problem: given two weighted point sets $A$ and $B$ in the Euclidean space $\mathbb{R}^d$, we want to quickly determine that whether their earth mover's distance (EMD) is larger or smaller than a pre-specified threshold $T\geq 0$. The problem finds a number of important applications in the fields of machine learning and data mining. In particular, we assume that the dimensionality $d$ is not fixed and the sizes $|A|$ and $|B|$ are large. Therefore, most of existing EMD algorithms are not quite efficient to solve this problem due to their high complexities. Here, we consider the problem under the assumption that $A$ and $B$ have low doubling dimensions, which is common for high-dimensional data in real world. Inspired by the geometric method {\em net tree}, we propose a novel ``data-dependent'' algorithm to avoid directly computing the EMD between $A$ and $B$, so as to solve this query problem more efficiently. We also study the performance of our method on synthetic and real datasets. The experimental results suggest that our method can save a large amount of running time comparing with existing EMD algorithms.

[1]  Teofilo F. GONZALEZ,et al.  Clustering to Minimize the Maximum Intercluster Distance , 1985, Theor. Comput. Sci..

[2]  Kenneth L. Clarkson,et al.  Smaller core-sets for balls , 2003, SODA '03.

[3]  Marco Cuturi,et al.  Subspace Detours: Building Transport Plans that are Optimal on Subspace Projections , 2019, NeurIPS.

[4]  Piotr Indyk,et al.  A near linear time constant factor approximation for Euclidean bichromatic matching (cost) , 2007, SODA '07.

[5]  Reynold Cheng,et al.  Earth Mover's Distance based Similarity Search at Scale , 2013, Proc. VLDB Endow..

[6]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[7]  Marco Cuturi,et al.  Sinkhorn Distances: Lightspeed Computation of Optimal Transport , 2013, NIPS.

[8]  Alexandr Andoni,et al.  Parallel algorithms for geometric graph problems , 2013, STOC.

[9]  Xi Chen,et al.  Spectral Methods Meet EM: A Provably Optimal Algorithm for Crowdsourcing , 2014, J. Mach. Learn. Res..

[10]  Yin Tat Lee,et al.  Path Finding Methods for Linear Programming: Solving Linear Programs in Õ(vrank) Iterations and Faster Algorithms for Maximum Flow , 2014, 2014 IEEE 55th Annual Symposium on Foundations of Computer Science.

[11]  Michael Werman,et al.  Fast and robust Earth Mover's Distances , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[12]  Koby Crammer,et al.  Learning Bounds for Domain Adaptation , 2007, NIPS.

[13]  Trevor Darrell,et al.  Fast contour matching using approximate earth mover's distance , 2004, CVPR 2004.

[14]  Bruce M. Maggs,et al.  On hierarchical routing in doubling metrics , 2005, SODA '05.

[15]  Leonidas J. Guibas,et al.  The Earth Mover's Distance under transformation sets , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[16]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[17]  Alessandro Rudi,et al.  Massively scalable Sinkhorn distances via the Nyström method , 2018, NeurIPS.

[18]  Éva Tardos,et al.  Polynomial dual network simplex algorithms , 2011, Math. Program..

[19]  Jonah Sherman,et al.  Generalized Preconditioning and Undirected Minimum-Cost Flow , 2017, SODA.

[20]  Yair Bartal,et al.  Probabilistic approximation of metric spaces and its algorithmic applications , 1996, Proceedings of 37th Conference on Foundations of Computer Science.

[21]  Ravindra K. Ahuja,et al.  Network Flows: Theory, Algorithms, and Applications , 1993 .

[22]  Aleksandar Nikolov,et al.  Preconditioning for the Geometric Transportation Problem , 2019, SoCG.

[23]  Wotao Yin,et al.  A Parallel Method for Earth Mover’s Distance , 2018, J. Sci. Comput..

[24]  S. M. García,et al.  2014: , 2020, A Party for Lazarus.

[25]  A. Frieze,et al.  20th Annual ACM-SIAM Symposium on Discrete Algorithms , 2009 .

[26]  Alexandr Andoni,et al.  Near-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[27]  Günter Rote,et al.  Matching point sets with respect to the Earth mover's distance , 2005, EuroCG.

[28]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[29]  J. van Leeuwen,et al.  Theoretical Computer Science , 2003, Lecture Notes in Computer Science.

[30]  David W. Jacobs,et al.  Approximate earth mover’s distance in linear time , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[31]  Shi Li,et al.  On constant factor approximation for earth mover distance over doubling metrics , 2010, ArXiv.

[32]  Kubilay Atasu,et al.  Linear-Complexity Data-Parallel Earth Mover's Distance Approximations , 2019, ICML.

[33]  Man Lung Yiu,et al.  The Power of Bounds: Answering Approximate Earth Mover's Distance with Parametric Bounds , 2021, IEEE Transactions on Knowledge and Data Engineering.

[34]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[35]  Bo Zhao,et al.  A Survey on Truth Discovery , 2015, SKDD.

[36]  Raphaël Clifford,et al.  ACM-SIAM Symposium on Discrete Algorithms , 2015, SODA 2015.

[37]  Dhruv Rohatgi,et al.  Conditional Hardness of Earth Mover Distance , 2019, APPROX-RANDOM.

[38]  C. Villani Topics in Optimal Transportation , 2003 .

[39]  Sariel Har-Peled,et al.  Fast construction of nets in low dimensional metrics, and their applications , 2004, SCG.

[40]  Nicolas Courty,et al.  Optimal Transport for Domain Adaptation , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[41]  F. N. Cole THE AMERICAN MATHEMATICAL SOCIETY. , 1910, Science.

[42]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[43]  Éva Tardos,et al.  A strongly polynomial minimum cost circulation algorithm , 1985, Comb..

[44]  Mikhail Belkin,et al.  Problems of learning on manifolds , 2003 .

[45]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[46]  Yi Li,et al.  Using the doubling dimension to analyze the generalization of learning algorithms , 2009, J. Comput. Syst. Sci..

[47]  Jason Altschuler,et al.  Near-linear time approximation algorithms for optimal transport via Sinkhorn iteration , 2017, NIPS.

[48]  Jinhui Xu,et al.  Novel Geometric Approach for Global Alignment of PPI Networks , 2017, AAAI.

[49]  Meng Zhang,et al.  Earth Mover’s Distance Minimization for Unsupervised Bilingual Lexicon Induction , 2017, EMNLP.

[50]  Alexandr Andoni,et al.  Earth mover distance over high-dimensional spaces , 2008, SODA '08.