论文信息 - ED-JOIN: AN EFFICIENT ALGORITHM FOR SIMILARITY JOINS WITH EDIT DISTANCE CONSTRAINTS - 字舞流文

ED-JOIN: AN EFFICIENT ALGORITHM FOR SIMILARITY JOINS WITH EDIT DISTANCE CONSTRAINTS

Similarity join is a fundamental operation in many application areas, such as data integration and cleaning, bioinformatics, and patte rn r cognition. In this project, we implement an efficient algorithm for similarity joi n with edit distance constraints. Current approaches are mainly that the edit distance constr ai t is converted to a weaker constraint on number of matching q-grams between pair of strin gs. In our project, we exploit a novel perspective of investigating mismatching q-gr am. We derive two new edit distance lower bounds by analyzing the locations and content s of mismatching q-grams. A new algorithm, Ed-Join, is proposed that exploits the n ew mismatch-based filtering methods; it achieves substantial reduction of the candidate siz s and hence saves computation time.

[1] Eugene W. Myers,et al. A fast bit-vector algorithm for approximate string matching based on dynamic programming , 1998, JACM.

[2] Geoffrey Zweig,et al. Syntactic Clustering of the Web , 1997, Comput. Networks.

[3] Surajit Chaudhuri,et al. Data Debugger: An Operator-Centric Approach for Data Quality Solutions , 2006, IEEE Data Eng. Bull..

[4] Michael J. Fischer,et al. The String-to-String Correction Problem , 1974, JACM.

[5] Sunita Sarawagi,et al. Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[6] Jeffrey Xu Yu,et al. Efficient similarity joins for near-duplicate detection , 2011, TODS.

[7] Sven Helmer,et al. Evaluation of Main Memory Join Algorithms for Joins with Set Comparison Join Predicates , 1996, VLDB.

[8] Salvatore J. Stolfo,et al. Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem , 1998, Data Mining and Knowledge Discovery.

[9] Anuradha Bhamidipaty,et al. Interactive deduplication using active learning , 2002, KDD.

[10] Monika Henzinger,et al. Finding near-duplicate web pages: a large-scale evaluation of algorithms , 2006, SIGIR.

[11] William E. Winkler,et al. The State of Record Linkage and Current Research Problems , 1999 .

[12] Surajit Chaudhuri,et al. Example-driven design of efficient record matching queries , 2007, VLDB.

[13] Christian Böhm,et al. Epsilon grid order: an algorithm for the similarity join on massive high-dimensional data , 2001, SIGMOD '01.

[14] Luis Gravano,et al. Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[15] Mike Paterson,et al. A Faster Algorithm Computing String Edit Distances , 1980, J. Comput. Syst. Sci..

[16] Jeffrey F. Naughton,et al. Set Containment Joins: The Good, The Bad and The Ugly , 2000, VLDB.

[17] Pradeep Ravikumar,et al. Adaptive Name Matching in Information Integration , 2003, IEEE Intell. Syst..

[18] Kyuseok Shim,et al. Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance , 2007, VLDB.

[19] Justin Zobel,et al. Performance in Practice of String Hashing Functions , 1997, DASFAA.

[20] Xuemin Lin,et al. Ed-Join: an efficient algorithm for similarity joins with edit distance constraints , 2008, Proc. VLDB Endow..

[21] Gonzalo Navarro,et al. A guided tour to approximate string matching , 2001, CSUR.

[22] Raghav Kaushik,et al. Efficient exact set-similarity joins , 2006, VLDB.

[23] JUSTIN ZOBEL,et al. Inverted files for text search engines , 2006, CSUR.

[24] Dan Gusfield,et al. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[25] Nikos Mamoulis,et al. Efficient processing of joins on set-valued attributes , 2003, SIGMOD '03.

[26] Patrick A. V. Hall,et al. Approximate String Matching , 1994, Encyclopedia of Algorithms.

[27] Bin Wang,et al. VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams , 2007, VLDB.

[28] Roberto J. Bayardo,et al. Scaling up all pairs similarity search , 2007, WWW '07.

[29] Hector Garcia-Molina,et al. Adaptive algorithms for set containment joins , 2003, TODS.

[30] Dan Gusfield,et al. Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[31] Divesh Srivastava,et al. Benchmarking declarative approximate selection predicates , 2007, SIGMOD '07.

[32] Alexandr Andoni,et al. Lower bounds for embedding edit distance into normed spaces , 2003, SODA '03.

[33] Surajit Chaudhuri,et al. A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[34] Jeffrey Xu Yu,et al. Efficient similarity joins for near duplicate detection , 2008, WWW.

[35] Piotr Indyk,et al. Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[36] Juha Kärkkäinen,et al. One-Gapped q-Gram Filtersfor Levenshtein Distance , 2002, CPM.

[37] Moses Charikar,et al. Similarity estimation techniques from rounding algorithms , 2002, STOC '02.