A partition-based method for string similarity joins with edit-distance constraints

As an essential operation in data cleaning, the similarity join has attracted considerable attention from the database community. In this article, we study string similarity joins with edit-distance constraints, which find similar string pairs from two large sets of strings whose edit distance is within a given threshold. Existing algorithms are efficient either for short strings or for long strings, and there is no algorithm that can efficiently and adaptively support both short strings and long strings. To address this problem, we propose a new filter, called the segment filter. We partition a string into a set of segments and use the segments as a filter to find similar string pairs. We first create inverted indices for the segments. Then for each string, we select some of its substrings, identify the selected substrings from the inverted indices, and take strings on the inverted lists of the found substrings as candidates of this string. Finally, we verify the candidates to generate the final answer. We devise efficient techniques to select substrings and prove that our method can minimize the number of selected substrings. We develop novel pruning techniques to efficiently verify the candidates. We also extend our techniques to support normalized edit distance. Experimental results show that our algorithms are efficient for both short strings and long strings, and outperform state-of-the-art methods on real-world datasets.

[1]  Guoliang Li,et al.  An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints , 2012, 2012 IEEE 28th International Conference on Data Engineering.

[2]  Xuemin Lin,et al.  Efficient exact edit similarity query processing with the asymmetric signature scheme , 2011, SIGMOD '11.

[3]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near duplicate detection , 2008, WWW.

[4]  Jiaheng Lu,et al.  Space-Constrained Gram-Based Indexing for Efficient Approximate String Search , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[5]  Bin Wang,et al.  Cost-based variable-length-gram selection for string collections to support approximate queries efficiently , 2008, SIGMOD Conference.

[6]  Guoliang Li,et al.  PASS-JOIN: A Partition-based Method for Similarity Joins , 2011, Proc. VLDB Endow..

[7]  Xuemin Lin,et al.  Top-k Set Similarity Joins , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[8]  Kyuseok Shim,et al.  Power-Law Based Estimation of Set Similarity Join Size , 2009, Proc. VLDB Endow..

[9]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[10]  Jeffrey F. Naughton,et al.  The Token Distribution Filter for Approximate String Membership , 2011, WebDB.

[11]  Jiaheng Lu,et al.  Efficient Merging and Filtering Algorithms for Approximate String Searches , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[12]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[13]  Kyuseok Shim,et al.  Similarity Join Size Estimation using Locality Sensitive Hashing , 2011, Proc. VLDB Endow..

[14]  Feifei Li,et al.  Probabilistic string similarity joins , 2010, SIGMOD Conference.

[15]  Marios Hadjieleftheriou,et al.  Efficient Approximate Search on String Collections , 2009, Proc. VLDB Endow..

[16]  Guoliang Li,et al.  Faerie: efficient filtering algorithms for approximate dictionary-based entity extraction , 2011, SIGMOD '11.

[17]  Divesh Srivastava,et al.  Fast Indexes and Algorithms for Set Similarity Selection Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[18]  Surajit Chaudhuri,et al.  Scalable ad-hoc entity extraction from text collections , 2008, Proc. VLDB Endow..

[19]  Guoliang Li,et al.  Trie-join: a trie-based method for efficient string similarity joins , 2012, The VLDB Journal.

[20]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[21]  Hanan Samet,et al.  Metric space similarity joins , 2008, TODS.

[22]  Guoliang Li,et al.  Trie-join , 2010, Proc. VLDB Endow..

[23]  Xuemin Lin,et al.  Ed-Join: an efficient algorithm for similarity joins with edit distance constraints , 2008, Proc. VLDB Endow..

[24]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[25]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[26]  Wen-Syan Li,et al.  Top-k string similarity search with edit-distance constraints , 2013, 2013 IEEE 29th International Conference on Data Engineering (ICDE).

[27]  Walid G. Aref,et al.  The similarity join database operator , 2010, 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010).

[28]  Beng Chin Ooi,et al.  Bed-tree: an all-purpose index structure for string similarity search based on edit distance , 2010, SIGMOD Conference.

[29]  Chen Li,et al.  Efficient parallel set-similarity joins using MapReduce , 2010, SIGMOD Conference.

[30]  Guoliang Li,et al.  Fast-join: An efficient method for fuzzy token matching based string similarity join , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[31]  Esko Ukkonen,et al.  Algorithms for Approximate String Matching , 1985, Inf. Control..

[32]  Surajit Chaudhuri,et al.  An efficient filter for approximate membership checking , 2008, SIGMOD Conference.

[33]  Kyuseok Shim,et al.  Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance , 2007, VLDB.

[34]  Xiaohui Yu,et al.  Hashed samples: selectivity estimators for set similarity selection queries , 2008, Proc. VLDB Endow..

[35]  Chen Li,et al.  SEPIA: estimating selectivities of approximate string predicates in large Databases , 2008, The VLDB Journal.

[36]  Guoliang Li,et al.  Efficient fuzzy full-text type-ahead search , 2011, The VLDB Journal.

[37]  Guoliang Li,et al.  Can we beat the prefix filtering?: an adaptive framework for similarity join and search , 2012, SIGMOD Conference.

[38]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[39]  Chen Li,et al.  Answering approximate string queries on large data sets using external memory , 2011, 2011 IEEE 27th International Conference on Data Engineering.

[40]  Gerhard Weikum,et al.  ACM Transactions on Database Systems , 2005 .

[41]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[42]  Divesh Srivastava,et al.  Incremental maintenance of length normalized indexes for approximate string matching , 2009, SIGMOD Conference.

[43]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[44]  Guoliang Li,et al.  Supporting Search-As-You-Type Using SQL in Databases , 2013, IEEE Transactions on Knowledge and Data Engineering.

[45]  Chengqi Zhang,et al.  Efficient approximate entity extraction with edit distance constraints , 2009, SIGMOD Conference.