A Practical q -Gram Index for Text Retrieval Allowing Errors

We propose an indexing technique for approximate text searching, which is practical and powerful, and especially optimized for natural language text. Unlike other indices of this kind, it is able to retrieve any string that approximately matches the search pattern, not only words. Every text substring of a xed length q is stored in the index, together with pointers to all the text positions where it appears. The search pattern is partitioned into pieces which are searched in the index, and all their occurrences in the text are veriied for a complete match. To reduce space requirements, pointers to blocks instead of exact positions can be used, which increases querying costs. We design an algorithm to optimize the pattern partition into pieces so that the total number of veriications is minimized. This is especially well suited for natural language texts, and allows to know in advance the expected cost of the search and the expected relevance of the query to the user. We show experimentally the building time, space requirements and querying time of our index, nding that it is a practical alternative for text retrieval. The retrieval times are reduced from 10% to 60% of the best on-line algorithm.

[1]  John C. Nesbit The accuracy of approximate string matching algorithms , 1986 .

[2]  Udi Manber,et al.  GLIMPSE: A Tool to Search Through Entire File Systems , 1994, USENIX Winter.

[3]  Erkki Sutinen,et al.  Lempel—Ziv Index for q -Grams , 1998, Algorithmica.

[4]  Archie L. Cobbs,et al.  Fast Approximate Matching using Suffix Trees , 1995, CPM.

[5]  Eugene W. Myers,et al.  A fast bit-vector algorithm for approximate string matching based on dynamic programming , 1998, JACM.

[6]  Erkki Sutinen,et al.  Filtration with q-Samples in Approximate String Matching , 1996, CPM.

[7]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[8]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[9]  Esko Ukkonen,et al.  Finding Approximate Patterns in Strings , 1985, J. Algorithms.

[10]  Erkki Sutinen,et al.  On Using q-Gram Locations in Approximate String Matching , 1995, ESA.

[11]  Ricardo A. Baeza-Yates,et al.  Block addressing indices for approximate text retrieval , 1997, CIKM '97.

[12]  Peter H. Sellers,et al.  The Theory and Computation of Evolutionary Distances: Pattern Recognition , 1980, J. Algorithms.

[13]  Ian H. Witten,et al.  Managing gigabytes , 1994 .

[14]  Udi Manber,et al.  A Sub-quadratic Algorithm for Approximate Limited Expression Matching 1 , 1992 .

[15]  Ricardo A. Baeza-Yates,et al.  A Faster Algorithm for Approximate String Matching , 1996, CPM.

[16]  Gonzalo Navarro,et al.  Multiple Approximate String Matching , 1997, WADS.

[17]  Esko Ukkonen,et al.  Constructing Suffix Trees On-Line in Linear Time , 1992, IFIP Congress.

[18]  Jordan Lampe,et al.  Theoretical and Empirical Comparisons of Approximate String Matching Algorithms , 1992, CPM.

[19]  Esko Ukkonen,et al.  Two Algorithms for Approximate String Matching in Static Texts , 1991, MFCS.

[20]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[21]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[22]  F. Frances Yao,et al.  Multi-index hashing for information retrieval , 1994, Proceedings 35th Annual Symposium on Foundations of Computer Science.

[23]  Gonzalo Navarro,et al.  Large text searching allowing errors , 1997 .

[24]  Esko Ukkonen,et al.  Lempel-Ziv parsing and sublinear-size index structures for string matching , 1996 .