Space-Constrained Gram-Based Indexing for Efficient Approximate String Search

Answering approximate queries on string collections is important in applications such as data cleaning, query relaxation, and spell checking, where inconsistencies and errors exist in user queries as well as data. Many existing algorithms use gram-based inverted-list indexing structures to answer approximate string queries. These indexing structures are "notoriously" large compared to the size of their original string collection. In this paper, we study how to reduce the size of such an indexing structure to a given amount of space, while retaining ef¿cient query processing. We ¿rst study how to adopt existing inverted-list compression techniques to solve our problem. Then, we propose two novel approaches for achieving the goal: one is based on discarding gram lists, and one is based on combining correlated lists. They are both orthogonal to existing compression techniques, exploit a unique property of our setting, and offer new opportunities for improving query performance. For each approach we analyze its effect on query performance and develop algorithms for wisely choosing lists to discard or combine. Our extensive experiments on real data sets show that our approaches provide applications the ¿exibility in deciding the tradeoff between query performance and indexing size, and can outperform existing compression techniques. An interesting and surprising ¿nding is that while we can reduce the index size signi¿cantly (up to 60% reduction) with tolerable performance penalties, for 20-40% reductions we can even improve query performance compared to original indexes.

[1]  Luis Gravano,et al.  Approximate String Joins in a Database (Almost) for Free , 2001, VLDB.

[2]  Rajeev Motwani,et al.  Robust and efficient fuzzy match for online data cleaning , 2003, SIGMOD '03.

[3]  Xiaohui Yu,et al.  Hashed samples: selectivity estimators for set similarity selection queries , 2008, Proc. VLDB Endow..

[4]  M. Douglas,et al.  Development of a Spelling List , 1982 .

[5]  Luis Gravano,et al.  Selectivity estimation for string predicates: overcoming the underestimation problem , 2004, Proceedings. 20th International Conference on Data Engineering.

[6]  Surajit Chaudhuri,et al.  A Primitive Operator for Similarity Joins in Data Cleaning , 2006, 22nd International Conference on Data Engineering (ICDE'06).

[7]  Jeffrey Xu Yu,et al.  Efficient similarity joins for near-duplicate detection , 2011, TODS.

[8]  Raghav Kaushik,et al.  Efficient exact set-similarity joins , 2006, VLDB.

[9]  Xuemin Lin,et al.  Ed-Join: an efficient algorithm for similarity joins with edit distance constraints , 2008, Proc. VLDB Endow..

[10]  Chen Li,et al.  Selectivity Estimation for Fuzzy String Predicates in Large Data Sets , 2005, VLDB.

[11]  Divesh Srivastava,et al.  Estimating the selectivity of approximate string queries , 2007, TODS.

[12]  Divesh Srivastava,et al.  Record linkage: similarity measures and algorithms , 2006, SIGMOD Conference.

[13]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[14]  Divesh Srivastava,et al.  Fast Indexes and Algorithms for Set Similarity Selection Queries , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[15]  Bin Wang,et al.  Cost-based variable-length-gram selection for string collections to support approximate queries efficiently , 2008, SIGMOD Conference.

[16]  Wenke Lee,et al.  q-gram matching using tree models , 2006, IEEE Transactions on Knowledge and Data Engineering.

[17]  M. D. McIlroy,et al.  Development of a Spelling List , 1982, IEEE Trans. Commun..

[18]  Lee Jae-Gil,et al.  n-Gram/2L: A Space and Time Efficient Two-Level n-Gram Inverted Index Structure , 2006 .

[19]  JUSTIN ZOBEL,et al.  Inverted files for text search engines , 2006, CSUR.

[20]  Divesh Srivastava,et al.  Substring selectivity estimation , 1999, PODS '99.

[21]  Kyuseok Shim,et al.  Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance , 2007, VLDB.

[22]  Sunita Sarawagi,et al.  Efficient set joins on similarity predicates , 2004, SIGMOD '04.

[23]  Alistair Moffat,et al.  Self-indexing inverted files for fast text retrieval , 1996, TOIS.

[24]  Athman Bouguettaya,et al.  An Efficient Near-Duplicate Video Shot Detection Method Using Shot-Based Interest Points , 2009, IEEE Transactions on Multimedia.

[25]  Z. Meral Özsoyoglu,et al.  Distance based indexing for string proximity search , 2003, Proceedings 19th International Conference on Data Engineering (Cat. No.03CH37405).

[26]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[27]  Bin Wang,et al.  VGRAM: Improving Performance of Approximate Queries on String Collections Using Variable-Length Grams , 2007, VLDB.

[28]  Zvi Galil,et al.  Data structures and algorithms for disjoint set union problems , 1991, CSUR.

[29]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[30]  Hakan Hacigümüs,et al.  Indexing text data under space constraints , 2004, CIKM '04.

[31]  Alistair Moffat,et al.  Inverted Index Compression Using Word-Aligned Binary Codes , 2004, Information Retrieval.

[32]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[33]  P. Krishnan,et al.  Estimating alphanumeric selectivity in the presence of wildcards , 1996, SIGMOD '96.

[34]  Jiaheng Lu,et al.  Efficient Merging and Filtering Algorithms for Approximate String Searches , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[35]  Solomon W. Golomb,et al.  Run-length encodings (Corresp.) , 1966, IEEE Trans. Inf. Theory.

[36]  Roberto J. Bayardo,et al.  Scaling up all pairs similarity search , 2007, WWW '07.