A Density-Aware Similarity Join Query Processing Algorithm on MapReduce

Recently, the amount of data is rapidly increasing and thus MapReduce has attracted much interest as a new paradigm for such data-intensive applications. Similarity join is an essential operation for data analytics, including record linkage, near duplicate detection, document clustering. However, the performance of MapReduce is limited when applied on complex data analytical task involving joins of multiple datasets. Hence, workload-aware data partitioning techniques are required, which ensure the balance of computation of each machine. In this paper, we propose a similarity join algorithm using MapReduce that provides scalability and high performance by using grid-based data mapping technique for joining datasets. From the experiment analysis, we prove that our algorithm outperforms the existing algorithm under various data size and similarity thresholds.

[1]  Min Wang,et al.  Efficient Multi-way Theta-Join Processing Using MapReduce , 2012, Proc. VLDB Endow..

[2]  Dimitrios Gunopulos,et al.  Efficient Confident Search in Large Review Corpora , 2010, ECML/PKDD.

[3]  Yeye He,et al.  ClusterJoin: A Similarity Joins Framework using Map-Reduce , 2014, Proc. VLDB Endow..

[4]  Mohamed F. Mokbel,et al.  Preference query evaluation over expensive attributes , 2010, CIKM.

[5]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[6]  Mirek Riedewald,et al.  Processing theta-joins using MapReduce , 2011, SIGMOD '11.

[7]  Seung-won Hwang,et al.  Navigation system for product search , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).