Set Similarity Joins on MapReduce: An Experimental Survey

Set similarity joins, which compute pairs of similar sets, constitute an important operator primitive in a variety of applications, including applications that must process large amounts of data. To handle these data volumes, several distributed set similarity join algorithms have been proposed. Unfortunately, little is known about the relative performance, strengths and weaknesses of these techniques. Previous comparisons are limited to a small subset of relevant algorithms, and the large differences in the various test setups make it hard to draw overall conclusions. In this paper we survey ten recent, distributed set similarity join algorithms, all based on the MapReduce paradigm. We empirically compare the algorithms in a uniform test environment on twelve datasets that expose different characteristics and represent a broad range of applications. Our experiments yield a surprising result: All algorithms in our test fail to scale for at least one dataset and are sensitive to long sets, frequent set elements, low similarity thresholds, or a combination thereof. Interestingly, some algorithms even fail to handle the small datasets that can easily be processed in a non-distributed setting. Our analytic investigation of the algorithms pinpoints the reasons for the poor performance and targeted experiments confirm our analytic findings. Based on our investigation, we suggest directions for future research in the area.

[1]  Jimmy J. Lin,et al.  Pairwise Document Similarity in Large Collections with MapReduce , 2008, ACL.

[2]  Yasin N. Silva,et al.  Exploiting MapReduce-based similarity joins , 2012, SIGMOD Conference.

[3]  Xiaofeng Meng,et al.  Parallel similarity joins on massive high‐dimensional data using MapReduce , 2016, Concurr. Comput. Pract. Exp..

[4]  Lionel M. Ni,et al.  Efficient Similarity Joins on Massive High-Dimensional Datasets Using MapReduce , 2012, 2012 IEEE 13th International Conference on Mobile Data Management.

[5]  Sumit Sarkar,et al.  A Distance-Based Approach to Entity Reconciliation in Heterogeneous Databases , 2002, IEEE Trans. Knowl. Data Eng..

[6]  Yasin N. Silva,et al.  An Experimental Survey of MapReduce-Based Similarity Joins , 2016, SISAP.

[7]  Divyakant Agrawal,et al.  Detectives: detecting coalition hit inflation attacks in advertising networks streams , 2007, WWW '07.

[8]  Zhifeng Bao,et al.  Dima: A Distributed In-Memory Similarity-Based Query Processing System , 2017, Proc. VLDB Endow..

[9]  Anthony K. H. Tung,et al.  Efficient and Scalable Processing of String Similarity Join , 2013, IEEE Transactions on Knowledge and Data Engineering.

[10]  Xiaoyong Du,et al.  Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[11]  Ranieri Baraglia,et al.  Document Similarity Self-Join with MapReduce , 2010, 2010 IEEE International Conference on Data Mining.

[12]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[13]  Michael H. Böhlen,et al.  The pq-gram distance between ordered labeled trees , 2010, TODS.

[14]  Hanan Samet,et al.  Metric space similarity joins , 2008, TODS.

[15]  Guoliang Li,et al.  PASS-JOIN: A Partition-based Method for Similarity Joins , 2011, Proc. VLDB Endow..

[16]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[17]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[18]  Ulf Leser,et al.  State-of-the-art in string similarity search and join , 2014, SGMD.

[19]  Guoliang Li,et al.  MassJoin: A mapreduce-based method for scalable string similarity joins , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[20]  Wen-Syan Li,et al.  String Similarity Joins: An Experimental Evaluation , 2014, Proc. VLDB Endow..

[21]  Nikolaus Augsten,et al.  An Empirical Evaluation of Set Similarity Join Techniques , 2016, Proc. VLDB Endow..

[22]  Andreas Paepcke,et al.  SpotSigs: robust and efficient near duplicate detection in large web collections , 2008, SIGIR '08.

[23]  Mirek Riedewald,et al.  Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[24]  Srinivasan Parthasarathy,et al.  Scalable all-pairs similarity search in metric spaces , 2013, KDD.

[25]  Guoliang Li,et al.  An Efficient Partition Based Method for Exact Set Similarity Joins , 2015, Proc. VLDB Endow..

[26]  Youzhong Ma,et al.  A novel approach for high‐dimensional vector similarity join query , 2017, Concurr. Comput. Pract. Exp..

[27]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[28]  Nikolaus Augsten,et al.  On-the-fly token similarity joins in relational databases , 2014, SIGMOD Conference.

[29]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[30]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[31]  Yeye He,et al.  ClusterJoin: A Similarity Joins Framework using Map-Reduce , 2014, Proc. VLDB Endow..

[32]  Christos Faloutsos,et al.  V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors , 2012, Proc. VLDB Endow..

[33]  Aditya G. Parameswaran,et al.  Fuzzy Joins Using MapReduce , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[34]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .