Heads-Join: Efficient Earth Mover's Distance Similarity Joins on Hadoop

The Earth Mover's Distance (EMD) similarity join has a number of important applications such as near duplicate image retrieval and distributed based pattern analysis. However, the computational cost of EMD is super cubic and consequently the EMD similarity join operation is prohibitive for datasets of even medium size. We propose to employ the Hadoop platform to speed up the operation. Simply porting the state-of-the-art metric distance similarity join algorithms to Hadoop results in inefficiency because they involve excessive distance computations and are vulnerable to skewed data distributions. We propose a novel framework, named HEADS-JOIN, which transforms data into the space of EMD lower bounds and performs pruning and partitioning at a low cost because computing these EMD lower bounds has constant or linear complexity. We investigate both range and top-k joins, and design efficient algorithms on three popular Hadoop computation paradigms, i.e., MapReduce, Bulk Synchronous Parallel, and Spark. We conduct extensive experiments on both real and synthetic datasets. The results show that HEADS-JOIN outperforms the state-of-the-art metric similarity join technique, i.e., Quickjoin, by up to an order of magnitude and scales out well.

[1]  Ambuj K. Singh,et al.  Indexing the Earth Mover's Distance Using Normal Distributions , 2011, Proc. VLDB Endow..

[2]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[3]  Bart Thomee,et al.  New trends and ideas in visual concept detection: the MIR flickr retrieval evaluation initiative , 2010, MIR '10.

[4]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[5]  Junsong Yuan,et al.  Robust hand gesture recognition based on finger-earth mover's distance with a commodity depth camera , 2011, ACM Multimedia.

[6]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[7]  Mostafa Bamha,et al.  Pipelining a Skew-Insensitive Parallel Join Algorithm , 2003, Parallel Process. Lett..

[8]  Mirek Riedewald,et al.  Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[9]  Richard P. Martin,et al.  Fast parallel sorting under logp: from theory to practice , 1993 .

[10]  Lars Kulik,et al.  A Motion-Aware Approach to Continuous Retrieval of 3D Objects , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[11]  Leonidas J. Guibas,et al.  The Earth Mover's Distance as a Metric for Image Retrieval , 2000, International Journal of Computer Vision.

[12]  Rui Zhang,et al.  The HV-tree , 2010, Proc. VLDB Endow..

[13]  Simon Urbanek,et al.  Unsupervised clustering of multidimensional distributions using earth mover distance , 2011, KDD.

[14]  Andrew Rau-Chaplin,et al.  Scalable parallel geometric algorithms for coarse grained multicomputers , 1993, SCG '93.

[15]  Ira Assent,et al.  Efficient EMD-based similarity search in multimedia databases via flexible dimensionality reduction , 2008, SIGMOD Conference.

[16]  Beng Chin Ooi,et al.  Efficient Processing of k Nearest Neighbor Joins using MapReduce , 2012, Proc. VLDB Endow..

[17]  Hanan Samet,et al.  Metric space similarity joins , 2008, TODS.

[18]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19]  Christos Faloutsos,et al.  V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors , 2012, Proc. VLDB Endow..

[20]  Yasin N. Silva,et al.  Exploiting MapReduce-based similarity joins , 2012, SIGMOD Conference.

[21]  Ira Assent,et al.  Approximation Techniques for Indexing the Earth Mover’s Distance in Multimedia Databases , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[22]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[23]  Reynold Xin,et al.  GraphX: Graph Processing in a Distributed Dataflow Framework , 2014, OSDI.

[24]  Anthony K. H. Tung,et al.  Efficient and effective similarity search over probabilistic data based on Earth Mover’s Distance , 2010, The VLDB Journal.

[25]  Dong Xu,et al.  Near duplicate image identification with patially Aligned Pyramid Matching , 2008, 2008 IEEE Conference on Computer Vision and Pattern Recognition.

[26]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[27]  L. Guibas,et al.  The Earth Mover''s Distance: Lower Bounds and Invariance under Translation , 1997 .

[28]  Scott Shenker,et al.  Discretized streams: fault-tolerant streaming computation at scale , 2013, SOSP.

[29]  Rajkumar Buyya,et al.  MELODY-JOIN: Efficient Earth Mover's Distance similarity joins using MapReduce , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[30]  Carlo Tomasi,et al.  Edge, Junction, and Corner Detection Using Color Distributions , 2001, IEEE Trans. Pattern Anal. Mach. Intell..