Large-Scale Similarity Join with Edit-Distance Constraints

In the age of big data, the data quality problem is more severe than ever. As an essential step in data cleaning, similarity join has attracted lots of attentions from the database community. In this work, to address the similarity join problem with edit-distance constraints, we first improve the partition-based join algorithm for small scale data. Then we extend the algorithm based on MapReduce framework for large-scale data. Extensive experiments on both real and simulated datasets demonstrate the efficiency of our algorithms.

[1]  Surajit Chaudhuri,et al.  Transformation-based Framework for Record Matching , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[2]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[3]  Xuemin Lin,et al.  Top-k Set Similarity Joins , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[4]  J. Bobadilla,et al.  Recommender systems survey , 2013, Knowl. Based Syst..

[5]  Jignesh M. Patel,et al.  A comparison of join algorithms for log processing in MaPreduce , 2010, SIGMOD Conference.

[6]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[7]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[8]  Douglas Stott Parker,et al.  Map-reduce-merge: simplified relational data processing on large clusters , 2007, SIGMOD '07.

[9]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[10]  Xuemin Lin,et al.  Ed-Join: an efficient algorithm for similarity joins with edit distance constraints , 2008, Proc. VLDB Endow..

[11]  Gerhard Weikum,et al.  The SphereSearch Engine for Unified Ranked Retrieval of Heterogeneous XML and Web Documents , 2005, VLDB.

[12]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[13]  Marco Thines,et al.  Signatures of Adaptation to Obligate Biotrophy in the Hyaloperonospora arabidopsidis Genome , 2010, Science.

[14]  Guoliang Li,et al.  PASS-JOIN: A Partition-based Method for Similarity Joins , 2011, Proc. VLDB Endow..

[15]  Setsuo Ohsuga,et al.  INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES , 1977 .

[16]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[17]  Beng Chin Ooi,et al.  Proceedings of the 2007 ACM SIGMOD international conference on Management of data , 2007, SIGMOD 2007.

[18]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[19]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[20]  Surajit Chaudhuri,et al.  An efficient filter for approximate membership checking , 2008, SIGMOD Conference.

[21]  Jingren Zhou,et al.  SCOPE: easy and efficient parallel processing of massive data sets , 2008, Proc. VLDB Endow..

[22]  David J. DeWitt,et al.  A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment , 1989, SIGMOD '89.

[23]  Guoliang Li,et al.  Efficient parallel partition-based algorithms for similarity search and join with edit distance constraints , 2013, EDBT '13.

[24]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[25]  Christopher Olston,et al.  Automatic Optimization of Parallel Dataflow Programs , 2008, USENIX Annual Technical Conference.