An Experimental Survey of MapReduce-Based Similarity Joins

In recent years, Big Data systems and their main data processing framework - MapReduce, have been introduced to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join (SJ), which finds similar pairs of objects between two datasets. The study of SJ techniques for Big Data systems has emerged as a key topic in the database community and several research teams have published techniques to solve the SJ problem on Big Data systems. However, many of these techniques were not experimentally compared against alternative approaches. This was the case in part because some of these techniques were developed in parallel while others were not implemented even as part of their original publications. Consequently, there is not a clear understanding of how these techniques compare to each other and which technique to use in specific scenarios. This paper addresses this problem by focusing on the study, classification and comparison of previously proposed MapReduce-based SJ algorithms. The contributions of this paper include the classification of SJs based on the supported data types and distance functions, and an extensive set of experimental results. Furthermore, the authors have made available their open-source implementation of many SJ algorithms to enable other researchers and practitioners to apply and extend these algorithms.

[1]  Yuan Yu,et al.  Dryad: distributed data-parallel programs from sequential building blocks , 2007, EuroSys '07.

[2]  Yasin N. Silva,et al.  MapReduce-based similarity join for metric spaces , 2012, Cloud-I '12.

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  Christian S. Jensen,et al.  A call for surveys , 2012, SGMD.

[5]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[6]  Yasin N. Silva,et al.  Similarity Joins: Their implementation and interactions with other database operators , 2015, Inf. Syst..

[7]  Walid G. Aref,et al.  SimDB: a similarity-aware database system , 2010, SIGMOD Conference.

[8]  Surajit Chaudhuri,et al.  Data Debugger: An Operator-Centric Approach for Data Quality Solutions , 2006, IEEE Data Eng. Bull..

[9]  GhemawatSanjay,et al.  The Google file system , 2003 .

[10]  Yasin N. Silva,et al.  Database Similarity Join for Metric Spaces , 2013, SISAP.

[11]  Wilson C. Hsieh,et al.  Bigtable: A Distributed Storage System for Structured Data , 2006, TOCS.

[12]  Bernhard Seeger,et al.  GESS: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces , 2001, KDD '01.

[13]  Walid G. Aref,et al.  The similarity join database operator , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[14]  Yasin N. Silva,et al.  Exploiting MapReduce-based similarity joins , 2012, SIGMOD Conference.

[15]  Pavel Zezula,et al.  Similarity Join in Metric Spaces Using eD-Index , 2003, DEXA.

[16]  Christian Böhm,et al.  Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data , 2001, SIGMOD '01.

[17]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[18]  Walid G. Aref,et al.  Similarity queries: their conceptual evaluation, transformations, and processing , 2013, The VLDB Journal.

[19]  Walid G. Aref,et al.  Similarity-aware Query Processing and Optimization , 2009, VLDB PhD Workshop.

[20]  Yasin N. Silva,et al.  Index-Based R-S Similarity Joins , 2014, SISAP.

[21]  Mirek Riedewald,et al.  Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[22]  Gordon S. Blair,et al.  A generic component model for building systems software , 2008, TOCS.

[23]  Yasin N. Silva,et al.  Exploiting Database Similarity Joins for Metric Spaces , 2012, Proc. VLDB Endow..

[24]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[25]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[26]  Yasin N. Silva,et al.  Similarity join for big geographic data , 2014 .

[27]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[28]  Christos Faloutsos,et al.  V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors , 2012, Proc. VLDB Endow..

[29]  Aditya G. Parameswaran,et al.  Fuzzy Joins Using MapReduce , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[30]  Hanan Samet,et al.  Metric space similarity joins , 2008, TODS.