Indexing text with approximate q-grams

We present a new index for approximate string matching. The index collects text q-samples, i.e. disjoint text substrings of length q, at fixed intervals and stores their positions. At search time, part of the text is filtered out by noticing that any occurrence of the pattern must be reflected in the presence of some text q-samples that match approximately inside the pattern. We show experimentally that the parameterization mechanism of the related filtration scheme provides a compromise between the space requirement of the index and the error level for which the filtration is still effcient.

[1]  Gonzalo Navarro,et al.  Faster Bit-Parallel Approximate String Matching , 2002, CPM.

[2]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[3]  M. V. Wilkes,et al.  The Art of Computer Programming, Volume 3, Sorting and Searching , 1974 .

[4]  Ricardo A. Baeza-Yates,et al.  Text-Retrieval: Theory and Practice , 1992, IFIP Congress.

[5]  Thomas G. Marr,et al.  Approximate String Matching and Local Similarity , 1994, CPM.

[6]  Archie L. Cobbs,et al.  Fast Approximate Matching using Suffix Trees , 1995, CPM.

[7]  Ricardo A. Baeza-Yates,et al.  A New Indexing Method for Approximate String Matching , 1999, CPM.

[8]  Z. Galil,et al.  Combinatorial Algorithms on Words , 1985 .

[9]  Robert Giegerich,et al.  Efficient implementation of lazy suffix trees , 1999, Softw. Pract. Exp..

[10]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[11]  Udi Manber,et al.  GLIMPSE: A Tool to Search Through Entire File Systems , 1994, USENIX Winter.

[12]  S. Muthukrishnan,et al.  Overcoming the memory bottleneck in suffix tree construction , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[13]  Donald E. Knuth,et al.  The art of computer programming, volume 3: (2nd ed.) sorting and searching , 1998 .

[14]  Gonzalo Navarro,et al.  Large text searching allowing errors , 1997 .

[15]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[16]  Ricardo A. Baeza-Yates,et al.  Block addressing indices for approximate text retrieval , 1997, CIKM '97.

[17]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[18]  David Haussler,et al.  The Smallest Automaton Recognizing the Subwords of a Text , 1985, Theor. Comput. Sci..

[19]  Eugene W. Myers,et al.  A fast bit-vector algorithm for approximate string matching based on dynamic programming , 1998, JACM.

[20]  Maxime Crochemore,et al.  Transducers and Repetitions , 1986, Theor. Comput. Sci..

[21]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[22]  Gonzalo Navarro,et al.  Faster Approximate String Matching , 1999, Algorithmica.

[23]  Erkki Sutinen,et al.  Filtration with q-Samples in Approximate String Matching , 1996, CPM.

[24]  Ricardo A. Baeza-Yates,et al.  A Practical q -Gram Index for Text Retrieval Allowing Errors , 2018, CLEI Electron. J..

[25]  Esko Ukkonen,et al.  Approximate String-Matching over Suffix Trees , 1993, CPM.

[26]  Peter H. Sellers,et al.  The Theory and Computation of Evolutionary Distances: Pattern Recognition , 1980, J. Algorithms.

[27]  Esko Ukkonen,et al.  Two Algorithms for Approximate String Matching in Static Texts , 1991, MFCS.