Compressed Suffix Trees for Repetitive Texts

We design a new compressed suffix tree specifically tailored to highly repetitive text collections. This is particularly useful for sequence analysis on large collections of genomes of the close species. We build on an existing compressed suffix tree that applies statistical compression, and modify it so that it works on the grammar-compressed version of the longest common prefix array, whose differential version inherits much of the repetitiveness of the text.

[1]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[2]  Gonzalo Navarro,et al.  Faster entropy-bounded compressed suffix trees , 2009, Theor. Comput. Sci..

[3]  Gonzalo Navarro,et al.  Practical Compressed Suffix Trees , 2010, SEA.

[4]  Johannes Fischer,et al.  Wee LCP , 2009, Inf. Process. Lett..

[5]  Rastislav Královič,et al.  Mathematical Foundations of Computer Science 2009, 34th International Symposium, MFCS 2009, Novy Smokovec, High Tatras, Slovakia, August 24-28, 2009. Proceedings , 2009, MFCS.

[6]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[7]  Gonzalo Navarro,et al.  Directly Addressable Variable-Length Codes , 2009, SPIRE.

[8]  Gonzalo Navarro,et al.  Self-indexed Text Compression Using Straight-Line Programs , 2009, MFCS.

[9]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[10]  Wojciech Rytter,et al.  Extracting Powers and Periods in a String from Its Runs Structure , 2010, SPIRE.

[11]  Miguel A. Martínez-Prieto,et al.  Compressed q-Gram Indexing for Highly Repetitive Biological Sequences , 2010, 2010 IEEE International Conference on BioInformatics and BioEngineering.

[12]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[13]  Rodrigo González,et al.  Compressed Text Indexes with Fast Locate , 2007, CPM.

[14]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.

[15]  John L. Smith Tables , 1969, Neuromuscular Disorders.

[16]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[17]  Gonzalo Navarro,et al.  Self-Index Based on LZ77 , 2011, ArXiv.

[18]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[19]  Z. Galil,et al.  Combinatorial Algorithms on Words , 1985 .

[20]  Alberto Apostolico,et al.  The Myriad Virtues of Subword Trees , 1985 .

[21]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[22]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[23]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[24]  Gonzalo Navarro,et al.  Succinct Suffix Arrays based on Run-Length Encoding , 2005, Nord. J. Comput..

[25]  Gonzalo Navarro,et al.  Self-indexing Based on LZ77 , 2011, CPM.

[26]  Rodrigo González,et al.  Compressed text indexes: From theory to practice , 2007, JEAL.

[27]  Gonzalo Navarro,et al.  Fully compressed suffix trees , 2008, TALG.

[28]  Simon Gog,et al.  Compressed suffix trees: design, construction, and applications , 2011 .

[29]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .