ED-JOIN: AN EFFICIENT ALGORITHM FOR SIMILARITY JOINS WITH EDIT DISTANCE CONSTRAINTS

Similarity join is a fundamental operation in many application areas, such as data integration and cleaning, bioinformatics, and patte rn r cognition. In this project, we implement an efficient algorithm for similarity joi n with edit distance constraints. Current approaches are mainly that the edit distance constr ai t is converted to a weaker constraint on number of matching q-grams between pair of strin gs. In our project, we exploit a novel perspective of investigating mismatching q-gr am. We derive two new edit distance lower bounds by analyzing the locations and content s of mismatching q-grams. A new algorithm, Ed-Join, is proposed that exploits the n ew mismatch-based filtering methods; it achieves substantial reduction of the candidate siz s and hence saves computation time.

[1]  Eugene W. Myers,et al.  A fast bit-vector algorithm for approximate string matching based on dynamic programming , 1998, JACM.

[2]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[3]  Surajit Chaudhuri,et al.  Data Debugger: An Operator-Centric Approach for Data Quality Solutions , 2006, IEEE Data Eng. Bull..

[4]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[5]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[6]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[7]  Sven Helmer,et al.  Evaluation of Main Memory Join Algorithms for Joins with Set Comparison Join Predicates , 1996, VLDB.

[8]  Salvatore J. Stolfo,et al.  Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[9]  Anuradha Bhamidipaty,et al.  Interactive deduplication using active learning , 2002, KDD.

[10]  Monika Henzinger,et al.  Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[11]  William E. Winkler,et al.  The State of Record Linkage and Current Research Problems , 1999 .

[12]  Surajit Chaudhuri,et al.  Example-driven design of efficient record matching queries , 2007, VLDB.

[13]  Christian Böhm,et al.  Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data , 2001, SIGMOD '01.

[14]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[15]  Mike Paterson,et al.  A Faster Algorithm Computing String Edit Distances , 1980, J. Comput. Syst. Sci..

[16]  Jeffrey F. Naughton,et al.  Set Containment Joins: The Good, The Bad and The Ugly , 2000, VLDB.

[17]  Pradeep Ravikumar,et al.  Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[18]  Kyuseok Shim,et al.  Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance , 2007, VLDB.

[19]  Justin Zobel,et al.  Performance in Practice of String Hashing Functions , 1997, DASFAA.

[20]  Xuemin Lin,et al.  Ed-Join: an efficient algorithm for similarity joins with edit distance constraints , 2008, Proc. VLDB Endow..

[21]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[22]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[23]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[24]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[25]  Nikos Mamoulis,et al.  Efficient processing of joins on set-valued attributes , 2003, SIGMOD '03.

[26]  Patrick A. V. Hall,et al.  Approximate String Matching , 1994, Encyclopedia of Algorithms.

[27]  Bin Wang,et al.  VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams , 2007, VLDB.

[28]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[29]  Hector Garcia-Molina,et al.  Adaptive algorithms for set containment joins , 2003, TODS.

[30]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[31]  Divesh Srivastava,et al.  Benchmarking declarative approximate selection predicates , 2007, SIGMOD '07.

[32]  Alexandr Andoni,et al.  Lower bounds for embedding edit distance into normed spaces , 2003, SODA '03.

[33]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[34]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near duplicate detection , 2008, WWW.

[35]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[36]  Juha Kärkkäinen,et al.  One-Gapped q-Gram Filtersfor Levenshtein Distance , 2002, CPM.

[37]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.