Faster entropy-bounded compressed suffix trees

Suffix trees are among the most important data structures in stringology, with a number of applications in flourishing areas like bioinformatics. Their main problem is space usage, which has triggered much research striving for compressed representations that are still functional. A smaller suffix tree representation could fit in a faster memory, outweighing by far the theoretical slowdown brought by the space reduction. We present a novel compressed suffix tree, which is the first achieving at the same time sublogarithmic complexity for the operations, and space usage that asymptotically goes to zero as the entropy of the text does. The main ideas in our development are compressing the longest common prefix information, totally getting rid of the suffix tree topology, and expressing all the suffix tree operations using range minimum queries and a novel primitive called next/previous smaller value in a sequence. Our solutions to those operations are of independent interest.

[1]  Wojciech Szpankowski,et al.  A Generalized Suffix Tree and its (Un)expected Asymptotic Behaviors , 1993, SIAM J. Comput..

[2]  Volker Heun,et al.  A New Succinct Representation of RMQ-Information and Improvements in the Enhanced Suffix Array , 2007, ESCAPE.

[3]  Maxime Crochemore,et al.  Finding Patterns In Given Intervals , 2007, Fundam. Informaticae.

[4]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[5]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prex Computation in Sux Arrays and Its Applications , 2001 .

[6]  Srinivas Aluru,et al.  Space efficient linear time construction of suffix arrays , 2003, J. Discrete Algorithms.

[7]  Maxime Crochemore,et al.  Algorithms on strings , 2007 .

[8]  Volker Heun,et al.  Range Median of Minima Queries, Super-Cartesian Trees, and Text Indexing , 2008, IWOCA.

[9]  Rodrigo González,et al.  Compressed text indexes: From theory to practice , 2007, JEAL.

[10]  Gonzalo Navarro,et al.  Rank and select revisited and extended , 2007, Theor. Comput. Sci..

[11]  Paolo Ferragina,et al.  A simple storage scheme for strings achieving entropy bounds , 2007, SODA '07.

[12]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.

[13]  Srinivas Aluru,et al.  Optimal Self-adjusting Trees for Dynamic String Data in Secondary Storage , 2007, SPIRE.

[14]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[15]  Gonzalo Navarro,et al.  Succinct Suffix Arrays based on Run-Length Encoding , 2005, Nord. J. Comput..

[16]  Alexander Golynski Optimal lower bounds for rank and select indexes , 2007, Theor. Comput. Sci..

[17]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[18]  Volker Heun,et al.  Practical Entropy-Bounded Schemes for O(1)-Range Minimum Queries , 2008, Data Compression Conference (dcc 2008).

[19]  Pang Ko,et al.  Linear Time Construction of Suffix Arrays , 2002 .

[20]  Wing-Kai Hon,et al.  Compressed indexes for dynamic text collections , 2007, TALG.

[21]  Gonzalo Navarro,et al.  Fully compressed suffix trees , 2008, TALG.

[22]  Gonzalo Navarro,et al.  An(other) Entropy-Bounded Compressed Suffix Tree , 2008, CPM.

[23]  Kunihiko Sadakane,et al.  Succinct data structures for flexible text retrieval systems , 2007, J. Discrete Algorithms.

[24]  S. Srinivasa Rao,et al.  Space Efficient Suffix Trees , 1998, J. Algorithms.

[25]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[26]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[27]  S. Srinivasa Rao,et al.  Full-Text Indexes in External Memory , 2002, Algorithms for Memory Hierarchies.

[28]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .

[29]  Gonzalo Navarro,et al.  Practical Rank/Select Queries over Arbitrary Sequences , 2008, SPIRE.

[30]  John L. Smith Tables , 1969, Neuromuscular Disorders.

[31]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[32]  Naila Rahman,et al.  Engineering the LOUDS Succinct Tree Representation , 2006, WEA.

[33]  Stephen Alstrup,et al.  Nearest common ancestors: a survey and a new distributed algorithm , 2002, SPAA.

[34]  Uzi Vishkin,et al.  Recursive Star-Tree Parallel Data Structure , 1993, SIAM J. Comput..

[35]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[36]  Rodrigo González,et al.  Compressed Text Indexes with Fast Locate , 2007, CPM.

[37]  Uzi Vishkin,et al.  Optimal Doubly Logarithmic Parallel Algorithms Based on Finding All Nearest Smaller Values , 1993, J. Algorithms.

[38]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[39]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[40]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[41]  Volker Heun,et al.  Theoretical and Practical Improvements on the RMQ-Problem, with Applications to LCA and LCE , 2006, CPM.

[42]  Steven Skiena,et al.  Lowest common ancestors in trees and directed acyclic graphs , 2005, J. Algorithms.

[43]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[44]  Naila Rahman,et al.  A simple optimal representation for balanced parentheses , 2004, Theor. Comput. Sci..

[45]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[46]  Peter Sanders,et al.  Simple Linear Work Suffix Array Construction , 2003, ICALP.

[47]  Alberto Apostolico,et al.  The Myriad Virtues of Subword Trees , 1985 .

[48]  Maxime Crochemore,et al.  Improved Algorithms for the Range Next Value Problem and Applications , 2008, STACS.