Optimal Exact Strring Matching Based on Suffix Arrays

Using the suffix tree of a string S, decision queries of the type "Is P a substring of S?" can be answered in O(|P|) time and enumeration queries of the type "Where are all z occurrences of P in S?" can be answered in O(|P|+z) time, totally independent of the size of S. However, in large scale applications as genome analysis, the space requirements of the suffix tree are a severe drawback. The suffix array is a more space economical index structure. Using it and an additional table, Manber and Myers (1993) showed that decision queries and enumeration queries can be answered in O(|P|+log |S|) and O(|P|+log |S|+z) time, respectively, but no optimal time algorithms are known. In this paper, we show how to achieve the optimal O(|P|) and O(|P| + z) time bounds for the suffix array. Our approach is not confined to exact pattern matching. In fact, it can be used to efficiently solve all problems that are usually solved by a top-down traversal of the suffix tree. Experiments show that our method is not only of theoretical interest but also of practical relevance.

[1]  김동규,et al.  [서평]「Algorithms on Strings, Trees, and Sequences」 , 2000 .

[2]  Robert Sedgewick,et al.  Fast algorithms for sorting and searching strings , 1997, SODA '97.

[3]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[4]  Giovanni Manzini,et al.  An experimental study of an opportunistic index , 2001, SODA '01.

[5]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[6]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .

[7]  Roberto Grossi,et al.  Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract) , 2000, STOC '00.

[8]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[9]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[10]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[11]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[12]  Alberto Apostolico,et al.  The Myriad Virtues of Subword Trees , 1985 .

[13]  Enno Ohlebusch,et al.  The Enhanced Suffix Array and Its Applications to Genome Analysis , 2002, WABI.

[14]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prex Computation in Sux Arrays and Its Applications , 2001 .