Technical Report: MapReduce-based Similarity Joins

Cloud enabled systems have become a crucial component to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join, which retrieves all data pairs whose distances are smaller than a predefined threshold e. Even though multiple algorithms and implementation techniques have been proposed for Similarity Joins, very little work has addressed the study of Similarity Joins for cloud systems. This paper focuses on the study, design and implementation techniques of cloudbased Similarity Joins. We present MRSimJoin, a MapReduce based algorithm to efficiently solve the Similarity Join problem. This algorithm efficiently partitions and distributes the data until the subsets are small enough to be processed in a single node. The proposed algorithm is general enough to be used with data that lies in any metric space. The algorithm can also be used with multiple data types, e.g., numerical data, vector data, text, etc. We present multiple guidelines to implement the algorithm in Hadoop, a highly used open-source cloud system. The extensive experimental evaluation of the implemented operation shows that it has very good execution time and scalability properties.

[1]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[2]  Christian Böhm,et al.  Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data , 2001, SIGMOD '01.

[3]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[4]  Anthony K. H. Tung,et al.  MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters , 2011, IEEE Transactions on Knowledge and Data Engineering.

[5]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[6]  Hanan Samet,et al.  Metric space similarity joins , 2008, TODS.

[7]  Michael Stonebraker,et al.  A comparison of approaches to large-scale data analysis , 2009, SIGMOD Conference.

[8]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[9]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[10]  Christian Böhm,et al.  The k-Nearest Neighbour Join: Turbo Charging the KDD Process , 2004, Knowledge and Information Systems.

[11]  David J. DeWitt,et al.  An Evaluation of Non-Equijoin Algorithms , 1991, VLDB.

[12]  Walid G. Aref,et al.  SimDB: a similarity-aware database system , 2010, SIGMOD Conference.

[13]  Hanan Samet,et al.  Incremental distance join algorithms for spatial databases , 1998, SIGMOD '98.

[14]  Bernhard Seeger,et al.  GESS: a scalable similarity-join algorithm for mining large data sets in high dimensional spaces , 2001, KDD '01.

[15]  Pavel Zezula,et al.  Similarity Join in Metric Spaces Using eD-Index , 2003, DEXA.

[16]  Hanan Samet,et al.  Index-driven similarity search in metric spaces (Survey Article) , 2003, TODS.

[17]  Walid G. Aref,et al.  The similarity join database operator , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[18]  Songting Chen,et al.  Cheetah , 2010, Proc. VLDB Endow..

[19]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[20]  Pavel Zezula,et al.  Similarity Join in Metric Spaces , 2003, ECIR.

[21]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[22]  Masaru Kitsuregawa,et al.  Bucket Spreading Parallel Hash: A New, Robust, Parallel Hash Join Method for Data Skew in the Super Database Computer (SDC) , 1990, VLDB.

[23]  Mirek Riedewald,et al.  Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[24]  David J. DeWitt,et al.  A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment , 1989, SIGMOD '89.