Practical Compressed Suffix Trees

The suffix tree is an extremely important data structure in bioinformatics. Classical implementations require much space, which renders them useless to handle large sequence collections. Recent research has obtained various compressed representations for suffix trees, with widely different space-time tradeoffs. In this paper we show how the use of range min-max trees yields novel representations achieving practical space/time tradeoffs. In addition, we show how those trees can be modified to index highly repetitive collections, obtaining the first compressed suffix tree representation that effectively adapts to that scenario.

[1]  V. Vinay,et al.  Proceedings of the 16th Conference on Foundations of Software Technology and Theoretical Computer Science , 1996 .

[2]  Simon J. Puglisi,et al.  Space-Time Tradeoffs for Longest-Common-Prefix Array Computation , 2008, ISAAC.

[3]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[4]  Miguel A. Martínez-Prieto,et al.  Compressed q-Gram Indexing for Highly Repetitive Biological Sequences , 2010, 2010 IEEE International Conference on BioInformatics and BioEngineering.

[5]  Gonzalo Navarro,et al.  Faster entropy-bounded compressed suffix trees , 2009, Theor. Comput. Sci..

[6]  Gonzalo Navarro,et al.  Practical Compressed Suffix Trees , 2010, SEA.

[7]  R. González,et al.  PRACTICAL IMPLEMENTATION OF RANK AND SELECT QUERIES , 2005 .

[8]  Gonzalo Navarro,et al.  Succinct Trees in Practice , 2010, ALENEX.

[9]  Kunihiko Sadakane,et al.  New text indexing functionalities of the compressed suffix arrays , 2003, J. Algorithms.

[10]  Kunihiko Sadakane,et al.  Practical Entropy-Compressed Rank/Select Dictionary , 2006, ALENEX.

[11]  Johannes Fischer,et al.  Wee LCP , 2009, Inf. Process. Lett..

[12]  K. Shadan,et al.  Available online: , 2012 .

[13]  Hiroshi Sakamoto,et al.  A fully linear-time approximation algorithm for grammar-based compression , 2003, J. Discrete Algorithms.

[14]  P. Gács,et al.  Algorithms , 1992 .

[15]  Juha Kärkkäinen,et al.  Permuted Longest-Common-Prefix Array , 2009, CPM.

[16]  Volker Heun,et al.  Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays , 2011, SIAM J. Comput..

[17]  Gonzalo Navarro,et al.  Fully-functional succinct trees , 2010, SODA '10.

[18]  John L. Smith Tables , 1969, Neuromuscular Disorders.

[19]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.

[20]  Gonzalo Navarro,et al.  Self-indexing Based on LZ77 , 2011, CPM.

[21]  Gonzalo Navarro,et al.  Succinct Suffix Arrays based on Run-Length Encoding , 2005, Nord. J. Comput..

[22]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[23]  Gonzalo Navarro,et al.  Faster Compact Top-k Document Retrieval , 2012, 2013 Data Compression Conference.

[24]  Rodrigo González,et al.  Compressed text indexes: From theory to practice , 2007, JEAL.

[25]  Gonzalo Navarro,et al.  Compressed Suffix Trees for Repetitive Texts , 2012, SPIRE.

[26]  Wolfgang Gerlach,et al.  Engineering a compressed suffix tree implementation , 2007, JEAL.

[27]  Gonzalo Navarro,et al.  Practical Rank/Select Queries over Arbitrary Sequences , 2008, SPIRE.

[28]  Gonzalo Navarro,et al.  Fully compressed suffix trees , 2008, TALG.

[29]  Simon Gog,et al.  Compressed suffix trees: design, construction, and applications , 2011 .

[30]  Gonzalo Navarro,et al.  An(other) Entropy-Bounded Compressed Suffix Tree , 2008, CPM.

[31]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .

[32]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[33]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[34]  Alberto Apostolico,et al.  The Myriad Virtues of Subword Trees , 1985 .

[35]  S. Srinivasa Rao,et al.  Space Efficient Suffix Trees , 1998, J. Algorithms.

[36]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[37]  Gonzalo Navarro,et al.  Directly Addressable Variable-Length Codes , 2009, SPIRE.

[38]  Kunihiko Sadakane,et al.  Succinct representations of lcp information and improvements in the compressed suffix arrays , 2002, SODA '02.

[39]  Gonzalo Navarro,et al.  Self-indexed Text Compression Using Straight-Line Programs , 2009, MFCS.

[40]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[41]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[42]  Rodrigo González,et al.  Compressed Text Indexes with Fast Locate , 2007, CPM.

[43]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[44]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[45]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[46]  Naila Rahman,et al.  A simple optimal representation for balanced parentheses , 2004, Theor. Comput. Sci..

[47]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.