Fast and Scalable Distributed Set Similarity Joins for Big Data Analytics

Set similarity join is an essential operation in big data analytics, e.g., data integration and data cleaning, that finds similar pairs from two collections of sets. To cope with the increasing scale of the data, distributed algorithms are called for to support large-scale set similarity joins. Multiple techniques have been proposed to perform similarity joins using MapReduce in recent years. These techniques, however, usually produce huge amounts of duplicates in order to perform parallel processing successfully as MapReduce is a shared-nothing framework. The large number of duplicates incurs on both large shuffle cost and unnecessary computation cost, which significantly decrease the performance. Moreover, these approaches do not provide a load balancing guarantee, which results in a skewness problem and negatively affects the scalability properties of these techniques. To address these problems, in this paper, we propose a duplicatefree framework, called FS-Join, to perform set similarity joins efficiently by utilizing an innovative vertical partitioning technique. FS-Join employs three powerful filtering methods to prune dissimilar string pairs without computing their similarity scores. To further improve the performance and scalability, FS-Join integrates horizontal partitioning. Experimental results on three real datasets show that FS-Join outperforms the state-of-theart methods by one order of magnitude on average, which demonstrates the good scalability and performance qualities of the proposed technique.

[1]  Guoliang Li,et al.  MassJoin: A mapreduce-based method for scalable string similarity joins , 2014, 2014 IEEE 30th International Conference on Data Engineering.

[2]  Christos Faloutsos,et al.  V-SMART-Join: A Scalable MapReduce Framework for All-Pair Similarity Joins of Multisets and Vectors , 2012, Proc. VLDB Endow..

[3]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[4]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[5]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[6]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[7]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[8]  Andreas Thor,et al.  Load Balancing for MapReduce-based Entity Resolution , 2011, 2012 IEEE 28th International Conference on Data Engineering.

[9]  Guoliang Li,et al.  PASS-JOIN: A Partition-based Method for Similarity Joins , 2011, Proc. VLDB Endow..

[10]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[11]  William R. Hersh,et al.  Managing Gigabytes—Compressing and Indexing Documents and Images (Second Edition) , 2001, Information Retrieval.

[12]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[13]  Guoliang Li,et al.  Can we beat the prefix filtering?: an adaptive framework for similarity join and search , 2012, SIGMOD Conference.

[14]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[15]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[16]  Anthony K. H. Tung,et al.  Efficient and Scalable Processing of String Similarity Join , 2013, IEEE Transactions on Knowledge and Data Engineering.

[17]  Charles Elkan,et al.  The Field Matching Problem: Algorithms and Applications , 1996, KDD.

[18]  Jian Li,et al.  Efficient Similarity Join and Search on Multi-Attribute Data , 2015, SIGMOD Conference.

[19]  Salvatore J. Stolfo,et al.  The merge/purge problem for large databases , 1995, SIGMOD '95.

[20]  Jiaheng Lu,et al.  String similarity measures and joins with synonyms , 2013, SIGMOD '13.

[21]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[22]  Jiaheng Lu,et al.  Efficient Merging and Filtering Algorithms for Approximate String Searches , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[23]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.