Linear-time String Indexing and Analysis in Small Space

The field of succinct data structures has flourished over the past 16 years. Starting from the compressed suffix array by Grossi and Vitter (STOC 2000) and the FM-index by Ferragina and Manzini (FOCS 2000), a number of generalizations and applications of string indexes based on the Burrows-Wheeler transform (BWT) have been developed, all taking an amount of space that is close to the input size in bits. In many large-scale applications, the construction of the index and its usage need to be considered as one unit of computation. For example, one can compare two genomes by building a common index for their concatenation and by detecting common substructures by querying the index. Efficient string indexing and analysis in small space lies also at the core of a number of primitives in the data-intensive field of high-throughput DNA sequencing. We report the following advances in string indexing and analysis: We show that the BWT of a string T ∈ {1,…,σ}n can be built in deterministic O(n) time using just O(n log σ) bits of space, where σ ≤ n. Deterministic linear time is achieved by exploiting a new partial rank data structure that supports queries in constant time and that might have independent interest. Within the same time and space budget, we can build an index based on the BWT that allows one to enumerate all the internal nodes of the suffix tree of T. Many fundamental string analysis problems, such as maximal repeats, maximal unique matches, and string kernels, can be mapped to such enumeration and can thus be solved in deterministic O(n) time and in O(n log σ) bits of space from the input string by tailoring the enumeration algorithm to some problem-specific computations. We also show how to build many of the existing indexes based on the BWT, such as the compressed suffix array, the compressed suffix tree, and the bidirectional BWT index, in randomized O(n) time and in O(n log σ) bits of space. The previously fastest construction algorithms for BWT, compressed suffix array and compressed suffix tree, which used O(n log σ) bits of space, took O(n log log σ) time for the first two structures and O(n log εn) time for the third, where ε is any positive constant smaller than one. Alternatively, the BWT could be previously built in linear time if one was willing to spend O(n log σ log log σ n) bits of space. Contrary to the state-of-the-art, our bidirectional BWT index supports every operation in constant time per element in its output.

[1]  Enno Ohlebusch,et al.  Computing the longest common prefix array based on the Burrows-Wheeler transform , 2011, J. Discrete Algorithms.

[2]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[3]  David Richard Clark,et al.  Compact pat trees , 1998 .

[4]  Siu-Ming Yiu,et al.  High Throughput Short Read Alignment via Bi-directional BWT , 2009, 2009 IEEE International Conference on Bioinformatics and Biomedicine.

[5]  Peter Weiner The file transmission problem , 1973, AFIPS National Computer Conference.

[6]  Gonzalo Navarro,et al.  Optimal Dynamic Sequence Representations , 2014, SIAM J. Comput..

[7]  Gonzalo Navarro,et al.  Alphabet-Independent Compressed Text Indexing , 2011, TALG.

[8]  Dan Gusfield,et al.  Algorithms on Strings, Trees, and Sequences - Computer Science and Computational Biology , 1997 .

[9]  Dominik Kempa Optimal Construction of Compressed Indexes for Highly Repetitive Texts , 2019, SODA.

[10]  Gonzalo Navarro,et al.  Improved compressed indexes for full-text document retrieval , 2013, J. Discrete Algorithms.

[11]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[12]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[13]  Giovanna Rosone,et al.  Space-Efficient Computation of the LCP Array from the Burrows-Wheeler Transform , 2019, CPM.

[14]  Kunihiko Sadakane,et al.  A Linear-Time Burrows-Wheeler Transform Using Induced Sorting , 2009, SPIRE.

[15]  Peter Elias,et al.  Efficient Storage and Retrieval by Content and Address of Static Files , 1974, JACM.

[16]  Sebastiano Vigna,et al.  Monotone minimal perfect hashing: searching a sorted table with O(1) accesses , 2009, SODA.

[17]  Kunihiko Sadakane,et al.  Practical Entropy-Compressed Rank/Select Dictionary , 2006, ALENEX.

[18]  Fabio Cunial,et al.  Fully-functional bidirectional Burrows-Wheeler indexes , 2019, CPM.

[19]  Sebastiano Vigna,et al.  Theory and practice of monotone minimal perfect hashing , 2011, JEAL.

[20]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[21]  C. Q. Lee,et al.  The Computer Journal , 1958, Nature.

[22]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[23]  Peter Elias,et al.  Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[24]  M. M. Robertson A generalization of quasi-monotone sequences , 1968 .

[25]  Brenda S. Baker,et al.  On finding duplication and near-duplication in large software systems , 1995, Proceedings of 2nd Working Conference on Reverse Engineering.

[26]  David Haussler,et al.  Convolution kernels on discrete structures , 1999 .

[27]  Rajeev Raman,et al.  More Haste, Less Waste: Lowering the Redundancy in Fully Indexable Dictionaries , 2009, STACS.

[28]  Gonzalo Navarro,et al.  Space-Efficient Construction of Compressed Indexes in Deterministic Linear Time , 2016, SODA.

[29]  Fabio Cunial,et al.  A Framework for Space-Efficient String Kernels , 2015, Algorithmica.

[30]  Wing-Kai Hon,et al.  Breaking a Time-and-Space Barrier in Constructing Full-Text Indices , 2009, SIAM J. Comput..

[31]  Srinivas Aluru,et al.  Space efficient linear time construction of suffix arrays , 2005, J. Discrete Algorithms.

[32]  Alberto Apostolico,et al.  The Myriad Virtues of Subword Trees , 1985 .

[33]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[34]  Gad M. Landau,et al.  Computing the Burrows-Wheeler transform in place and in small space , 2015, J. Discrete Algorithms.

[35]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[36]  Gonzalo Navarro,et al.  Fully-functional succinct trees , 2010, SODA '10.

[37]  Enno Ohlebusch,et al.  Space-Efficient Computation of Maximal and Supermaximal Repeats in Genome Sequences , 2012, SPIRE.

[38]  Wing-Kai Hon,et al.  Space-Economical Algorithms for Finding Maximal Unique Matches , 2002, CPM.

[39]  S. Srinivasa Rao,et al.  Rank/select operations on large alphabets: a tool for text indexing , 2006, SODA '06.

[40]  Enno Ohlebusch,et al.  Computing the Burrows-Wheeler transform of a string and its reverse in parallel , 2014, J. Discrete Algorithms.

[41]  Torben Hagerup,et al.  Efficient Minimal Perfect Hashing in Nearly Minimal Space , 2001, STACS.

[42]  Tomasz Kociumaka,et al.  String synchronizing sets: sublinear-time BWT construction and optimal LCE data structure , 2019, STOC.

[43]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[44]  Johannes Fischer Combined data structure for previous- and next-smaller-values , 2011, Theor. Comput. Sci..

[45]  S. Muthukrishnan,et al.  Efficient algorithms for document retrieval problems , 2002, SODA '02.

[46]  C. SIAMJ. LOW REDUNDANCY IN STATIC DICTIONARIES WITH CONSTANT QUERY TIME , 2001 .

[47]  Fabio Cunial,et al.  Indexed Matching Statistics and Shortest Unique Substrings , 2014, SPIRE.

[48]  Paolo Ferragina,et al.  A simple storage scheme for strings achieving entropy bounds , 2007, SODA '07.

[49]  Enno Ohlebusch,et al.  Bidirectional search in a string with wavelet trees and bidirectional matching statistics , 2012, Inf. Comput..

[50]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.

[51]  Albert Wilansky,et al.  Between T 1 and T 2 , 1967 .

[52]  Antonio Restivo,et al.  Automata and Forbidden Words , 1998, Inf. Process. Lett..

[53]  Enno Ohlebusch,et al.  Bidirectional Search in a String with Wavelet Trees , 2010, CPM.

[54]  Kunihiko Sadakane,et al.  Compressed Text Databases with Efficient Query Algorithms Based on the Compressed Suffix Array , 2000, ISAAC.

[55]  Jun'ichi Tsujii,et al.  Text Categorization with All Substring Features , 2009, SDM.

[56]  Johannes Fischer,et al.  Optimal Succinctness for Range Minimum Queries , 2008, LATIN.

[57]  M. Farach Optimal suffix tree construction with large alphabets , 1997, Proceedings 38th Annual Symposium on Foundations of Computer Science.

[58]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[59]  Djamal Belazzougui,et al.  Linear time construction of compressed text indices in compact space , 2014, STOC.

[60]  Peter Sanders,et al.  Linear work suffix array construction , 2006, JACM.

[61]  Kunihiko Sadakane,et al.  Succinct representations of lcp information and improvements in the compressed suffix arrays , 2002, SODA '02.

[62]  Jeffrey Scott Vitter,et al.  Efficient Maximal Repeat Finding Using the Burrows-Wheeler Transform and Wavelet Tree , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[63]  Nello Cristianini,et al.  Classification using String Kernels , 2000 .

[64]  John L. Smith Tables , 1969, Neuromuscular Disorders.

[65]  J. Ian Munro,et al.  Succinct Representation of Balanced Parentheses and Static Trees , 2002, SIAM J. Comput..

[66]  Dong Kyue Kim,et al.  Constructing suffix arrays in linear time , 2005, J. Discrete Algorithms.

[67]  Kunihiko Sadakane,et al.  Succinct data structures for flexible text retrieval systems , 2007, J. Discrete Algorithms.

[68]  Siu-Ming Yiu,et al.  SOAP2: an improved ultrafast tool for short read alignment , 2009, Bioinform..

[69]  Rajeev Raman,et al.  Succinct Representations of Permutations , 2003, ICALP.

[70]  Dan E. Willard,et al.  Log-logarithmic worst-case range queries are possible in space ⊕(N) , 1983 .

[71]  Solon P. Pissis,et al.  Linear-time computation of minimal absent words using suffix array , 2014, BMC Bioinformatics.

[72]  Kunihiko Sadakane,et al.  Fully Functional Static and Dynamic Succinct Trees , 2009, TALG.

[73]  Gad M. Landau,et al.  Dynamic text and static pattern matching , 2007, TALG.