Faster Compressed Suffix Trees for Repetitive Collections

Recent compressed suffix trees targeted to highly repetitive sequence collections reach excellent compression performance, but operation times are very high. We design a new suffix tree representation for this scenario that still achieves very low space usage, only slightly larger than the best previous one, but supports the operations orders of magnitude faster. Our suffix tree is still orders of magnitude slower than general-purpose compressed suffix trees, but these use several times more space when the collection is repetitive. Our main novelty is a practical grammar-compressed tree representation with full navigation functionality, which is useful in all applications where large trees with repetitive topology must be represented.

[1]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[2]  Juha Kärkkäinen,et al.  LCP Array Construction in External Memory , 2014, SEA.

[3]  Gonzalo Navarro,et al.  Succinct Trees in Practice , 2010, ALENEX.

[4]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[5]  Gonzalo Navarro,et al.  Faster entropy-bounded compressed suffix trees , 2009, Theor. Comput. Sci..

[6]  Gonzalo Navarro,et al.  Practical Compressed Suffix Trees , 2010, SEA.

[7]  Esko Ukkonen,et al.  Constructing Suffix Trees On-Line in Linear Time , 1992, IFIP Congress.

[8]  Peter Sanders,et al.  Better external memory suffix array construction , 2008, JEAL.

[9]  S. Srinivasa Rao,et al.  Full-Text Indexes in External Memory , 2002, Algorithms for Memory Hierarchies.

[10]  Johannes Fischer,et al.  Wee LCP , 2009, Inf. Process. Lett..

[11]  Juha Kärkkäinen,et al.  Engineering a Lightweight External Memory Suffix Array Construction Algorithm , 2017, ICABD.

[12]  Hiroshi Sakamoto,et al.  A fully linear-time approximation algorithm for grammar-based compression , 2003, J. Discrete Algorithms.

[13]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.

[14]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[15]  Kunihiko Sadakane,et al.  Fast relative Lempel-Ziv self-index for similar sequences , 2014, Theor. Comput. Sci..

[16]  Gad M. Landau,et al.  Random access to grammar-compressed strings , 2010, SODA '11.

[17]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[18]  Gonzalo Navarro,et al.  Compressed Suffix Trees for Repetitive Texts , 2012, SPIRE.

[19]  Rajeev Raman,et al.  Succinct Representations of Permutations , 2003, ICALP.

[20]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[21]  Hiroshi Sakamoto,et al.  A Succinct Grammar Compression , 2013, CPM.

[22]  Gonzalo Navarro,et al.  Grammar compressed sequences with rank/select support , 2014, J. Discrete Algorithms.

[23]  Gonzalo Navarro,et al.  Practical Compressed Document Retrieval , 2011, SEA.

[24]  Gonzalo Navarro,et al.  On compressing and indexing repetitive sequences , 2013, Theor. Comput. Sci..

[25]  Juha Kärkkäinen,et al.  A Faster Grammar-Based Self-index , 2011, LATA.

[26]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[27]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[28]  Sebastian Maneth,et al.  Tree Transducers and Tree Compressions , 2004, FoSSaCS.

[29]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[30]  Enno Ohlebusch,et al.  Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction , 2013 .

[31]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[32]  Sebastian Maneth,et al.  Tree Structure Compression with RePair , 2011, 2011 Data Compression Conference.

[33]  L. Jorde,et al.  Genetic variation, classification and 'race' , 2004, Nature Genetics.

[34]  Gonzalo Navarro,et al.  DACs: Bringing direct access to variable-length codes , 2013, Inf. Process. Manag..

[35]  Gonzalo Navarro,et al.  Indexing Highly Repetitive Collections , 2012, IWOCA.

[36]  Gonzalo Navarro,et al.  Faster Compressed Suffix Trees for Repetitive Text Collections , 2014, SEA.

[37]  S. Tishkoff,et al.  Implications of biogeography of human populations for 'race' and medicine , 2004, Nature Genetics.

[38]  Gonzalo Navarro,et al.  Fully-functional succinct trees , 2010, SODA '10.

[39]  Enno Ohlebusch,et al.  Replacing suffix trees with enhanced suffix arrays , 2004, J. Discrete Algorithms.

[40]  Maxime Crochemore,et al.  Suffix Tree of Alignment: An Efficient Index for Similar Data , 2013, IWOCA.

[41]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[42]  J. Shane Culpepper,et al.  Large-Scale Pattern Search Using Reduced-Space On-Disk Suffix Arrays , 2013, IEEE Transactions on Knowledge and Data Engineering.

[43]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[44]  Gonzalo Navarro,et al.  Self-Indexed Grammar-Based Compression , 2011, Fundam. Informaticae.

[45]  Alberto Apostolico,et al.  The Myriad Virtues of Subword Trees , 1985 .

[46]  Paolo Ferragina,et al.  A Theoretical and Experimental Study on the Construction of Suffix Arrays in External Memory , 2001, Algorithmica.

[47]  Travis Gagie,et al.  Lightweight Data Indexing and Compression in External Memory , 2009, Algorithmica.

[48]  Kunihiko Sadakane,et al.  Fully Functional Static and Dynamic Succinct Trees , 2009, TALG.

[49]  Justin Zobel,et al.  Optimized Relative Lempel-Ziv Compression of Genomes , 2011, ACSC.

[50]  Hubert Comon,et al.  Tree automata techniques and applications , 1997 .

[51]  Gonzalo Navarro,et al.  Fully compressed suffix trees , 2008, TALG.

[52]  Gonzalo Navarro,et al.  Improved Grammar-Based Compressed Indexes , 2012, SPIRE.

[53]  Simon Gog,et al.  Compressed suffix trees: design, construction, and applications , 2011 .

[54]  Stefan Kurtz,et al.  Reducing the space requirement of suffix trees , 1999 .

[55]  Ulrich Meyer,et al.  Algorithms for Memory Hierarchies , 2003, Lecture Notes in Computer Science.

[56]  Roberto Grossi,et al.  The string B-tree: a new data structure for string search in external memory and its applications , 1999, JACM.