Fully Functional Suffix Trees and Optimal Text Searching in BWT-Runs Bounded Space

Indexing highly repetitive texts—such as genomic databases, software repositories and versioned text collections—has become an important problem since the turn of the millennium. A relevant compressibility measure for repetitive texts is r, the number of runs in their Burrows-Wheeler Transforms (BWTs). One of the earliest indexes for repetitive collections, the Run-Length FM-index, used O(r) space and was able to efficiently count the number of occurrences of a pattern of length m in a text of length n (in O(m log log n) time, with current techniques). However, it was unable to locate the positions of those occurrences efficiently within a space bounded in terms of r. In this article, we close this long-standing problem, showing how to extend the Run-Length FM-index so that it can locate the occ occurrences efficiently (in O(occ log log n) time) within O(r) space. By raising the space to O(r log log n), our index counts the occurrences in optimal time, O(m), and locates them in optimal time as well, O(m + occ). By further raising the space by an O(w/ log σ) factor, where σ is the alphabet size and w = Ω (log n) is the RAM machine size in bits, we support count and locate in O(⌈ m log (σ)/w ⌉) and O(⌈ m log (σ)/w ⌉ + occ) time, which is optimal in the packed setting and had not been obtained before in compressed space. We also describe a structure using O(r log (n/r)) space that replaces the text and extracts any text substring of length ℓ in the almost-optimal time O(log (n/r)+ℓ log (σ)/w). Within that space, we similarly provide access to arbitrary suffix array, inverse suffix array, and longest common prefix array cells in time O(log (n/r)), and extend these capabilities to full suffix tree functionality, typically in O(log (n/r)) time per operation. Our experiments show that our O(r)-space index outperforms the space-competitive alternatives by 1--2 orders of magnitude in time. Competitive implementations of the original FM-index are outperformed by 1--2 orders of magnitude in space and/or 2--3 in time.

[1]  Gonzalo Navarro A Self-index on Block Trees , 2017, SPIRE.

[2]  Dominik Kempa,et al.  At the roots of dictionary compression: string attractors , 2017, STOC.

[3]  Juha Kärkkäinen,et al.  Permuted Longest-Common-Prefix Array , 2009, CPM.

[4]  Volker Heun,et al.  Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays , 2011, SIAM J. Comput..

[5]  Dong Kyue Kim,et al.  Constructing suffix arrays in linear time , 2005, J. Discrete Algorithms.

[6]  Artur Jez A really simple approximation of smallest grammar , 2016, Theor. Comput. Sci..

[7]  Philip Bille,et al.  Time-space trade-offs for longest common extensions , 2014, J. Discrete Algorithms.

[8]  Raphael Clifford,et al.  Combinatorial Pattern Matching (CPM) , 2011 .

[9]  M. Schatz,et al.  Big Data: Astronomical or Genomical? , 2015, PLoS biology.

[10]  Pawel Gawrychowski,et al.  Sparse Suffix Tree Construction in Optimal Time and Space , 2017, SODA.

[11]  Sebastiano Vigna,et al.  Theory and practice of monotone minimal perfect hashing , 2011, JEAL.

[12]  Christina Boucher,et al.  Efficient Construction of a Complete Index for Pan-Genomics Read Alignment , 2018, bioRxiv.

[13]  Joong Chae Na,et al.  Suffix Array of Alignment: A Practical Index for Similar Data , 2013, SPIRE.

[14]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[15]  Alistair Moffat,et al.  From Theory to Practice: Plug and Play with Succinct Data Structures , 2013, SEA.

[16]  Gonzalo Navarro,et al.  Faster entropy-bounded compressed suffix trees , 2009, Theor. Comput. Sci..

[17]  Martin Dietzfelbinger,et al.  Hash, Displace, and Compress , 2009, ESA.

[18]  Gonzalo Navarro,et al.  Practical Compressed Suffix Trees , 2010, SEA.

[19]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[20]  Gonzalo Navarro,et al.  Faster Repetition-Aware Compressed Suffix Trees based on Block Trees , 2019, SPIRE.

[21]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[22]  Johannes Fischer,et al.  Wee LCP , 2009, Inf. Process. Lett..

[23]  R. Hudson,et al.  Adjusting the focus on human variation. , 2000, Trends in genetics : TIG.

[24]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.

[25]  Tomasz Kociumaka,et al.  Resolution of the Burrows-Wheeler Transform Conjecture , 2019, 2020 IEEE 61st Annual Symposium on Foundations of Computer Science (FOCS).

[26]  Ben Langmead,et al.  The DNA Data Deluge: Fast, efficient genome sequencing machines are spewing out more data than geneticists can analyze. , 2013, IEEE spectrum.

[27]  Mathieu Raffinot,et al.  Composite Repetition-Aware Data Structures , 2015, CPM.

[28]  S. Srinivasa Rao,et al.  Rank/select operations on large alphabets: a tool for text indexing , 2006, SODA '06.

[29]  Enno Ohlebusch,et al.  Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction , 2013 .

[30]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[31]  Alberto Policriti,et al.  LZ77 Computation Based on the Run-Length Encoded BWT , 2018, Algorithmica.

[32]  Gonzalo Navarro,et al.  On the Approximation Ratio of Greedy Parsings , 2018, ArXiv.

[33]  Hideo Bannai,et al.  Dynamic index, LZ factorization, and LCE queries in compressed space , 2015, ArXiv.

[34]  Fabio Cunial,et al.  Representing the suffix tree with the CDAWG , 2017, CPM.

[35]  Maxime Crochemore,et al.  Suffix Tree of Alignment: An Efficient Index for Similar Data , 2013, IWOCA.

[36]  Gonzalo Navarro,et al.  On the Approximation Ratio of Lempel-Ziv Parsing , 2018, LATIN.

[37]  Alexandru I. Tomescu,et al.  Genome-Scale Algorithm Design: Genomics , 2015 .

[38]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[39]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[40]  Kunihiko Sadakane,et al.  Fully Functional Static and Dynamic Succinct Trees , 2009, TALG.

[41]  Simon J. Puglisi,et al.  Approximate pattern matching in LZ77-compressed texts , 2015, J. Discrete Algorithms.

[42]  Juha Kärkkäinen,et al.  A Faster Grammar-Based Self-index , 2011, LATA.

[43]  Hector Ferrada,et al.  Hybrid Indexing Revisited , 2018, ALENEX.

[44]  Abhi Shelat,et al.  The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[45]  Gonzalo Navarro,et al.  Fully compressed suffix trees , 2008, TALG.

[46]  Gonzalo Navarro,et al.  Improved Grammar-Based Compressed Indexes , 2012, SPIRE.

[47]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[48]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[49]  Juha Kärkkäinen,et al.  LZ77-Based Self-indexing with Faster Pattern Matching , 2014, LATIN.

[50]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[51]  I Tomohiro,et al.  Longest Common Extensions with Recompression , 2016, CPM.

[52]  Peter Sanders,et al.  Linear work suffix array construction , 2006, JACM.

[53]  Esko Ukkonen,et al.  On-line construction of suffix trees , 1995, Algorithmica.

[54]  Gonzalo Navarro,et al.  Self-Indexed Grammar-Based Compression , 2011, Fundam. Informaticae.

[55]  Hideo Bannai,et al.  Online LZ77 Parsing and Matching Statistics with RLBWTs , 2018, CPM.

[56]  M. C. Schatz,et al.  The DNA data deluge , 2013, IEEE Spectrum.

[57]  Hiroki Arimura,et al.  Linear-Size CDAWG: New Repetition-Aware Indexing and Grammar Compression , 2017, SPIRE.

[58]  Enno Ohlebusch,et al.  Compressed suffix trees: Efficient computation and storage of LCP-values , 2013, JEAL.

[59]  Sebastiano Vigna,et al.  Fast Prefix Search in Little Space, with Applications , 2010, ESA.

[60]  Gonzalo Navarro,et al.  Relative Suffix Trees , 2015, Comput. J..

[61]  Wojciech Rytter,et al.  Application of Lempel-Ziv factorization to the approximation of grammar-based compression , 2002, Theor. Comput. Sci..

[62]  Gonzalo Navarro,et al.  Time-Optimal Top-k Document Retrieval , 2017, SIAM J. Comput..

[63]  Gonzalo Navarro,et al.  Universal Compressed Text Indexing , 2018, Theor. Comput. Sci..

[64]  Miguel A. Martínez-Prieto,et al.  Universal indexes for highly repetitive document collections , 2016, Inf. Syst..

[65]  Gonzalo Navarro,et al.  Succinct Suffix Arrays based on Run-Length Encoding , 2005, Nord. J. Comput..

[66]  Gad M. Landau,et al.  Random Access to Grammar-Compressed Strings and Trees , 2015, SIAM J. Comput..

[67]  Prezza Nicola,et al.  Compressed Computation for Text Indexing , 2017 .

[68]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets , 2007, ACM Trans. Algorithms.

[69]  Mikko Berggren Ettienne,et al.  Compressed Indexing with Signature Grammars , 2018, LATIN.

[70]  Gonzalo Navarro,et al.  Alphabet-Independent Compressed Text Indexing , 2011, TALG.

[71]  Dominik Kempa Optimal Construction of Compressed Indexes for Highly Repetitive Texts , 2019, SODA.

[72]  Gonzalo Navarro,et al.  On compressing and indexing repetitive sequences , 2013, Theor. Comput. Sci..

[73]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[74]  Gonzalo Navarro,et al.  Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections , 2008, SPIRE.

[75]  Gonzalo Navarro,et al.  Faster Compressed Suffix Trees for Repetitive Collections , 2016, ACM J. Exp. Algorithmics.

[76]  Brittney N. Keel,et al.  Comparison of Burrows-Wheeler Transform-Based Mapping Algorithms Used in High-Throughput Whole-Genome Sequencing: Application to Illumina Data for Livestock Genomes , 2018, Front. Genet..

[77]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[78]  Artur Jez,et al.  Approximation of grammar-based compression via recompression , 2013, Theor. Comput. Sci..

[79]  Gonzalo Navarro,et al.  Optimal Lower and Upper Bounds for Representing Sequences , 2011, TALG.

[80]  En-Hui Yang,et al.  Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[81]  Hector Ferrada,et al.  Hybrid indexes for repetitive datasets , 2013, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[82]  Hiroshi Sakamoto,et al.  A faster implementation of online RLBWT and its application to LZ77 parsing , 2018, J. Discrete Algorithms.

[83]  Justin Zobel,et al.  Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval , 2010, SPIRE.

[84]  Gonzalo Navarro,et al.  Document Listing on Repetitive Collections with Guaranteed Performance 1 , 2018 .

[85]  Volker Heun,et al.  Finding Range Minima in the Middle: Approximations and Applications , 2010, Math. Comput. Sci..

[86]  Edward M. McCreight,et al.  A Space-Economical Suffix Tree Construction Algorithm , 1976, JACM.

[87]  Fabio Cunial,et al.  Fast Label Extraction in the CDAWG , 2017, SPIRE.

[88]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[89]  Dominik Kempa,et al.  LZ-End Parsing in Compressed Space , 2016, 2017 Data Compression Conference (DCC).

[90]  Dan E. Willard Examining Computational Geometry, Van Emde Boas Trees, and Hashing from the Perspective of the Fusion Tree , 2000, SIAM J. Comput..

[91]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[92]  Gonzalo Navarro,et al.  Space-Efficient Construction of Compressed Indexes in Deterministic Linear Time , 2016, SODA.

[93]  János Komlós,et al.  Storing a sparse table with O(1) worst case access time , 1982, 23rd Annual Symposium on Foundations of Computer Science (sfcs 1982).

[94]  Michael L. Fredman,et al.  Surpassing the Information Theoretic Bound with Fusion Trees , 1993, J. Comput. Syst. Sci..

[95]  Rodrigo González,et al.  Locally Compressed Suffix Arrays , 2015, ACM J. Exp. Algorithmics.

[96]  Philip Bille,et al.  Time-Space Trade-Offs for Lempel-Ziv Compressed Indexing , 2017, CPM.

[97]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[98]  Esko Ukkonen,et al.  Lempel-Ziv parsing and sublinear-size index structures for string matching , 1996 .

[99]  Kunihiko Sadakane,et al.  Fast relative Lempel-Ziv self-index for similar sequences , 2014, Theor. Comput. Sci..

[100]  Milan Ruzic,et al.  Constructing Efficient Dictionaries in Close to Sorting Time , 2008, ICALP.

[101]  David Haussler,et al.  Complete inverted files for efficient text retrieval and analysis , 1987, JACM.

[102]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[103]  Travis Gagie,et al.  Relative FM-Indexes , 2014, SPIRE.

[104]  Hiroshi Sakamoto,et al.  A Faster Implementation of Online Run-Length Burrows-Wheeler Transform , 2017, IWOCA.

[105]  Gonzalo Navarro,et al.  Compact Data Structures - A Practical Approach , 2016 .

[106]  Sebastiano Vigna,et al.  Monotone minimal perfect hashing: searching a sorted table with O(1) accesses , 2009, SODA.

[107]  Elad Verbin,et al.  Data Structure Lower Bounds on Random Access to Grammar-Compressed Strings , 2013, CPM.

[108]  Gonzalo Navarro,et al.  Storage and Retrieval of Individual Genomes , 2009, RECOMB.

[109]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[110]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[111]  Rodrigo González,et al.  Compressed Text Indexes with Fast Locate , 2007, CPM.

[112]  Gonzalo Navarro,et al.  Optimal-Time Text Indexing in BWT-runs Bounded Space , 2017, SODA.

[113]  Juha Kärkkäinen,et al.  Linear Time Lempel-Ziv Factorization: Simple, Fast, Small , 2012, CPM.

[114]  Moshe Lewenstein,et al.  Dynamic weighted ancestors , 2007, SODA '07.

[115]  Philip Bille,et al.  Time-space trade-offs for Lempel-Ziv compressed indexing , 2018, Theor. Comput. Sci..

[116]  Srinivas Aluru,et al.  Space efficient linear time construction of suffix arrays , 2003, J. Discrete Algorithms.

[117]  Gonzalo Navarro,et al.  Faster Compressed Suffix Trees for Repetitive Text Collections , 2014, SEA.

[118]  Markus Hsi-Yang Fritz,et al.  Efficient storage of high throughput DNA sequencing data using reference-based compression. , 2011, Genome research.

[119]  S. Muthukrishnan,et al.  On the sorting-complexity of suffix tree construction , 2000, JACM.

[120]  Yasuo Tabei,et al.  Queries on LZ-Bounded Encodings , 2014, 2015 Data Compression Conference.

[121]  Hideo Bannai,et al.  Fully Dynamic Data Structure for LCE Queries in Compressed Space , 2016, MFCS.

[122]  Carlos Martín-Vide,et al.  Language and Automata Theory and Applications , 2015, Lecture Notes in Computer Science.

[123]  Gonzalo Navarro,et al.  Compressed representations of sequences and full-text indexes , 2007, TALG.

[124]  Rasko Leinonen,et al.  The sequence read archive: explosive growth of sequencing data , 2011, Nucleic Acids Res..

[125]  Yasuo Tabei,et al.  Access, Rank, and Select in Grammar-compressed Strings , 2015, ESA.

[126]  S. Janson Tail bounds for sums of geometric and exponential variables , 2017, 1709.08157.

[127]  Travis Gagie,et al.  Prefix-free parsing for building big BWTs , 2018, Algorithms for Molecular Biology.

[128]  Gonzalo Navarro,et al.  Optimal-Time Dictionary-Compressed Indexes , 2018, ACM Trans. Algorithms.

[129]  Tsuyoshi Murata,et al.  {m , 1934, ACML.