Efficient exact edit similarity query processing with the asymmetric signature scheme

Given a query string Q, an edit similarity search finds all strings in a database whose edit distance with Q is no more than a given threshold t. Most existing method answering edit similarity queries rely on a signature scheme to generate candidates given the query string. We observe that the number of signatures generated by existing methods is far greater than the lower bound, and this results in high query time and index space complexities. In this paper, we show that the minimum signature size lower bound is t +1. We then propose asymmetric signature schemes that achieve this lower bound. We develop efficient query processing algorithms based on the new scheme. Several dynamic programming-based candidate pruning methods are also developed to further speed up the performance. We have conducted a comprehensive experimental study involving nine state-of-the-art algorithms. The experiment results clearly demonstrate the efficiency of our methods.

[1]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[2]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[3]  Surajit Chaudhuri,et al.  Extending autocompletion to tolerate errors , 2009, SIGMOD Conference.

[4]  Anthony K. H. Tung,et al.  Similarity Search on Bregman Divergence: Towards Non-Metric Indexing , 2009, Proc. VLDB Endow..

[5]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.

[6]  Benno Stein Principles of hash-based text retrieval , 2007, SIGIR.

[7]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[8]  Pavel Zezula,et al.  Similarity Join in Metric Spaces Using eD-Index , 2003, DEXA.

[9]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[10]  Moses Charikar,et al.  Similarity estimation techniques from rounding algorithms , 2002, STOC '02.

[11]  Beng Chin Ooi,et al.  Bed-tree: an all-purpose index structure for string similarity search based on edit distance , 2010, SIGMOD Conference.

[12]  Bin Wang,et al.  Cost-based variable-length-gram selection for string collections to support approximate queries efficiently , 2008, SIGMOD Conference.

[13]  Pavel Zezula,et al.  M-tree: An Efficient Access Method for Similarity Search in Metric Spaces , 1997, VLDB.

[14]  Beng Chin Ooi,et al.  Making the pyramid technique robust to query types and workloads , 2004, Proceedings. 20th International Conference on Data Engineering.

[15]  Andreas Paepcke,et al.  SpotSigs: robust and efficient near duplicate detection in large web collections , 2008, SIGIR '08.

[16]  Ambuj K. Singh,et al.  Efficient Index Structures for String Databases , 2001, VLDB.

[17]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[18]  Marios Hadjieleftheriou,et al.  Efficient Approximate Search on String Collections , 2009, Proc. VLDB Endow..

[19]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[20]  Michael J. Fischer,et al.  The String-to-String Correction Problem , 1974, JACM.

[21]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[22]  Anthony K. H. Tung,et al.  Efficient and effective similarity search over probabilistic data based on Earth Mover’s Distance , 2010, The VLDB Journal.

[23]  Xuemin Lin,et al.  Ed-Join: an efficient algorithm for similarity joins with edit distance constraints , 2008, Proc. VLDB Endow..

[24]  Jiaheng Lu,et al.  Efficient Merging and Filtering Algorithms for Approximate String Searches , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[25]  Guoliang Li,et al.  Trie-join , 2010, Proc. VLDB Endow..

[26]  George Forman,et al.  Finding similar files in large document repositories , 2005, KDD '05.

[27]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[28]  Thai Ngoc Thuy ED-JOIN: AN EFFICIENT ALGORITHM FOR SIMILARITY JOINS WITH EDIT DISTANCE CONSTRAINTS , 2009 .

[29]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[30]  Anthony K. H. Tung,et al.  Efficient and effective similarity search over probabilistic data based on Earth Mover’s Distance , 2010, The VLDB Journal.

[31]  Marios Hadjieleftheriou,et al.  R-Trees - A Dynamic Index Structure for Spatial Searching , 2008, ACM SIGSPATIAL International Workshop on Advances in Geographic Information Systems.

[32]  Bin Wang,et al.  VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams , 2007, VLDB.

[33]  Gonzalo Navarro,et al.  Indexing Variable Length Substrings for Exact and Approximate Matching , 2009, SPIRE.

[34]  Pavel Zezula,et al.  Similarity Join in Metric Spaces , 2003, ECIR.

[35]  Ricardo A. Baeza-Yates,et al.  A Practical q -Gram Index for Text Retrieval Allowing Errors , 2018, CLEI Electron. J..

[36]  Ophir Frieder,et al.  Collection statistics for fast duplicate document detection , 2002, TOIS.

[37]  Gary Benson,et al.  Tandem repeats over the edit distance , 2007, Bioinform..

[38]  R. Ewy,et al.  ABSTRACT , 1986 .

[39]  Chengqi Zhang,et al.  Efficient approximate entity extraction with edit distance constraints , 2009, SIGMOD Conference.