Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching

The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for space-efficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text $T$ consisting of $n$ symbols drawn from a fixed alphabet $\Sigma$. The text $T$ can be represented in $n \lg |\Sigma|$ bits by encoding each symbol with $\lg |\Sigma|$ bits. The goal is to support fast online queries for searching any string pattern $P$ of $m$ symbols, with $T$ being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require $\Omega(n \lg n)$ additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need $\Omega(n)$ memory words, each of $\Omega(\lg n)$ bits. These indexes are larger than the text itself by a multiplicative factor of $\Omega(\smash{\lg_{|\Sigma|} n})$, which is significant when $\Sigma$ is of constant size, such as in \textsc{ascii} or \textsc{unicode}. On the other hand, these indexes support fast searching, either in $O(m \lg |\Sigma|)$ time or in $O(m + \lg n)$ time, plus an output-sensitive cost $O(\mathit{occ})$ for listing the $\mathit{occ}$ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast $\smash{O(m /\lg_{|\Sigma|} n + \lg_{|\Sigma|}^\epsilon n)}$ search time in the worst case, for any constant $0 < \epsilon \leq 1$, using at most $\smash{\bigl(\epsilon^{-1} + O(1)\bigr) \, n \lg |\Sigma|}$ bits of storage. Our result thus presents for the first time an efficient index whose size is provably linear in the size of the text in the worst case, and for many scenarios, the space is actually sublinear in practice. As a concrete example, the compressed suffix array for a typical 100 MB \textsc{ascii} file can require 30--40 MB or less, while the raw suffix array requires 500 MB. Our theoretical bounds improve \emph{both} time and space of previous indexing schemes. Listing the pattern occurrences introduces a sublogarithmic slowdown factor in the output-sensitive cost, giving $O(\mathit{occ} \, \smash{\lg_{|\Sigma|}^\epsilon n})$ time as a result. When the patterns are sufficiently long, we can use auxiliary data structures in $O(n \lg |\Sigma|)$ bits to obtain a total search bound of $O(m /\lg_{|\Sigma|} n + \mathit{occ})$ time, which is optimal.

[1]  David Haussler,et al.  Complete inverted files for efficient text retrieval and analysis , 1987, JACM.

[2]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[3]  Giovanni Manzini,et al.  An experimental study of an opportunistic index , 2001, SODA '01.

[4]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[5]  S. Rao Kosaraju Real-time pattern matching and quasi-real-time construction of suffix trees (preliminary version) , 1994, STOC '94.

[6]  Arne Andersson,et al.  Efficient implementation of suffix trees , 1995, Softw. Pract. Exp..

[7]  S. Muthukrishnan,et al.  Optimal Logarithmic Time Randomized Suffix Tree Construction , 1996, ICALP.

[8]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[9]  J. Ian Munro,et al.  Membership in Constant Time and Almost-Minimum Space , 1999, SIAM J. Comput..

[10]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[11]  Alistair Moffat,et al.  Self-indexing inverted files for fast text retrieval , 1996, TOIS.

[12]  Rasmus Pagh Low Redundancy in Static Dictionaries with Constant Query Time , 2001, SIAM J. Comput..

[13]  Venkatesh Raman,et al.  Succinct representation of balanced parentheses, static trees and planar graphs , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[14]  Leszek Gasieniec,et al.  Approximate Dictionary Queries , 1996, CPM.

[15]  Robin Milner,et al.  On Observing Nondeterminism and Concurrency , 1980, ICALP.

[16]  S. Srinivasa Rao,et al.  Space Efficient Suffix Trees , 1998, J. Algorithms.

[17]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[18]  Juha Kärkkäinen,et al.  Sparse Suffix Trees , 1996, COCOON.

[19]  Mikkel Thorup,et al.  String Matching in Lempel—Ziv Compressed Strings , 1998, Algorithmica.

[20]  Hermann A. Maurer,et al.  Efficient worst-case data structures for range searching , 1978, Acta Informatica.

[21]  Andrew Chi-Chih Yao,et al.  Dictionary Look-Up with One Error , 1997, J. Algorithms.

[22]  Gary Benson,et al.  Efficient two-dimensional compressed matching , 1992, Data Compression Conference, 1992..

[23]  Gary Benson,et al.  Let sleeping files lie: pattern matching in Z-compressed files , 1994, SODA '94.

[24]  David Richard Clark,et al.  Compact pat trees , 1998 .

[25]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[26]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[27]  Guy Joseph Jacobson,et al.  Succinct static data structures , 1988 .

[28]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[29]  Kotagiri Ramamohanarao,et al.  Inverted files versus signature files for text indexing , 1998, TODS.

[30]  Uzi Vishkin,et al.  Symmetry breaking for suffix tree construction , 1994, STOC '94.

[31]  Stephen Alstrup,et al.  New data structures for orthogonal range searching , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[32]  Juha Kärkkäinen Suffix Cactus: A Cross between Suffix Tree and Suffix Array , 1995, CPM.

[33]  Livio Colussi,et al.  A Time and Space Efficient Data Structure for String Searching on Large Texts , 1996, Inf. Process. Lett..

[34]  Roberto Grossi,et al.  The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.

[35]  Kunihiko Sadakane,et al.  Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array , 2000, ISAAC.

[36]  Zvi Galil,et al.  Time-Space-Optimal String Matching , 1983, J. Comput. Syst. Sci..

[37]  J. Ian Munro,et al.  Efficient Suffix Trees on Secondary Storage (extended Abstract) , 1996, SODA.

[38]  Ricardo A. Baeza-Yates,et al.  Compression: A Key for Next-Generation Text Retrieval Systems , 2000, Computer.

[39]  Robert Giegerich,et al.  Efficient implementation of lazy suffix trees , 2003, Softw. Pract. Exp..

[40]  Raffaele Giancarlo,et al.  The Myriad Virtues of Suffix Trees , 2006 .

[41]  Erik D. Demaine,et al.  A linear lower bound on index size for text retrieval , 2001, SODA '01.

[42]  Peter Elias,et al.  Efficient Storage and Retrieval by Content and Address of Static Files , 1974, JACM.

[43]  S. Muthukrishnan,et al.  On the sorting-complexity of suffix tree construction , 2000, JACM.

[44]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .

[45]  Maxime Crochemore,et al.  Two-way string-matching , 1991, JACM.

[46]  Arne Andersson,et al.  Suffix Trees on Words , 1996, Algorithmica.

[47]  Martin Farach-Colton,et al.  Optimal Suffix Tree Construction with Large Alphabets , 1997, FOCS.

[48]  Wojciech Rytter,et al.  Text Algorithms , 1994 .

[49]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[50]  Roberto Grossi,et al.  Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract) , 2000, STOC '00.

[51]  Kunihiko Sadakane,et al.  Succinct representations of lcp information and improvements in the compressed suffix arrays , 2002, SODA '02.

[52]  Donald E. Knuth,et al.  Sorting and Searching , 1973 .

[53]  Meng He,et al.  Indexing Compressed Text , 2003 .

[54]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[55]  Gad M. Landau,et al.  Parallel construction of a suffix tree with applications , 1988, Algorithmica.

[56]  S. Rao Kosaraju,et al.  Large-scale assembly of DNA strings and space-efficient construction of suffix trees , 1995, STOC '95.

[57]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[58]  Maxime Crochemore,et al.  Transducers and Repetitions , 1986, Theor. Comput. Sci..

[59]  Gad M. Landau,et al.  Text Indexing and Dictionary Matching with One Error , 2000, J. Algorithms.

[60]  Dan E. Willard,et al.  On the application of sheared retrieval to orthogonal range queries , 1986, SCG '86.

[61]  Udi Manber,et al.  GLIMPSE: A Tool to Search Through Entire File Systems , 1994, USENIX Winter.

[62]  Wing-Kai Hon,et al.  Breaking a time-and-space barrier in constructing full-text indices , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[63]  Erkki Sutinen,et al.  Lempel—Ziv Index for q -Grams , 1998, Algorithmica.

[64]  Veli Mäkinen,et al.  Compact Suffix Array , 2000, CPM.

[65]  David Haussler,et al.  The Smallest Automaton Recognizing the Subwords of a Text , 1985, Theor. Comput. Sci..

[66]  János Komlós,et al.  Storing a sparse table with O(1) worst case access time , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).