Improved Approximate String Matching Using Compressed Suffix Data Structures

Abstract Approximate string matching is about finding a given string pattern in a text by allowing some degree of errors. In this paper we present a space efficient data structure to solve the 1-mismatch and 1-difference problems. Given a text T of length n over an alphabet A, we can preprocess T and give an $O(n\sqrt{\log n}\log |A|)$ -bit space data structure so that, for any query pattern P of length m, we can find all 1-mismatch (or 1-difference) occurrences of P in O(|A|mlog log n+occ) time, where occ is the number of occurrences. This is the fastest known query time given that the space of the data structure is o(nlog 2n) bits. The space of our data structure can be further reduced to O(nlog |A|) with the query time increasing by a factor of log εn, for 0<ε≤1. Furthermore, our solution can be generalized to solve the k-mismatch (and the k-difference) problem in O(|A|kmk(k+log log n)+occ) and O(log εn(|A|kmk(k+log log n)+occ)) time using an $O(n\sqrt{\log n}\log |A|)$ -bit and an O(nlog |A|)-bit indexing data structures, respectively. We assume that the alphabet size |A| is bounded by $O(2^{\sqrt{\log n}})$ for the $O(n\sqrt{\log n}\log |A|)$ -bit space data structure.

[1]  Peter H. Sellers,et al.  The Theory and Computation of Evolutionary Distances: Pattern Recognition , 1980, J. Algorithms.

[2]  Wing-Kai Hon,et al.  Approximate string matching using compressed suffix arrays , 2006, Theor. Comput. Sci..

[3]  Esko Ukkonen,et al.  Two Algorithms for Approximate String Matching in Static Texts , 1991, MFCS.

[4]  J. Ian Munro,et al.  Succinct Representation of Balanced Parentheses and Static Trees , 2002, SIAM J. Comput..

[5]  Gad M. Landau,et al.  Fast Parallel and Serial Approximate String Matching , 1989, J. Algorithms.

[6]  S. Srinivasa Rao,et al.  Space Efficient Suffix Trees , 1998, J. Algorithms.

[7]  Michael T. Goodrich,et al.  Range Searching Over Tree Cross Products , 2000, ESA.

[8]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[9]  Gonzalo Navarro,et al.  A Hybrid Indexing Method for Approximate String Matching , 2007 .

[10]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[11]  Roberto Grossi,et al.  Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract) , 2000, STOC '00.

[12]  Esko Ukkonen,et al.  Approximate String-Matching over Suffix Trees , 1993, CPM.

[13]  Erkki Sutinen,et al.  Filtration with q-Samples in Approximate String Matching , 1996, CPM.

[14]  Wing-Kai Hon,et al.  Approximate String Matching Using Compressed Suffix Arrays , 2004, CPM.

[15]  Dan E. Willard,et al.  Log-logarithmic worst-case range queries are possible in space ⊕(N) , 1983 .

[16]  S. Srinivasa Rao Time-space trade-offs for compressed suffix arrays , 2002, Inf. Process. Lett..

[17]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.

[18]  Wing-Kai Hon,et al.  Breaking a time-and-space barrier in constructing full-text indices , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[19]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[20]  Archie L. Cobbs,et al.  Fast Approximate Matching using Suffix Trees , 1995, CPM.

[21]  Richard Cole,et al.  Dictionary matching and indexing with errors and don't cares , 2004, STOC '04.

[22]  Moshe Lewenstein,et al.  Faster algorithms for string matching with k mismatches , 2000, SODA '00.

[23]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[24]  Gonzalo Navarro,et al.  A Practical Index for Text Retrieval Allowing Errors , 2008 .

[25]  Gad M. Landau,et al.  Text Indexing and Dictionary Matching with One Error , 2000, J. Algorithms.

[26]  Erkki Sutinen,et al.  Indexing text with approximate q-grams , 2000, J. Discrete Algorithms.

[27]  Ricardo A. Baeza-Yates,et al.  A New Indexing Method for Approximate String Matching , 1999, CPM.