Closing in on Time and Space Optimal Construction of Compressed Indexes

Fast and space-efficient construction of compressed indexes such as compressed suffix array (CSA) and compressed suffix tree (CST) has been a major open problem until recently, when Belazzougui [STOC 2014] described an algorithm able to build both of these data structures in $O(n)$ (randomized; later improved by the same author to deterministic) time and $O(n/\log_{\sigma}n)$ words of space, where $n$ is the length of the string and $\sigma$ is the alphabet size. Shortly after, Munro et al. [SODA 2017] described another deterministic construction using the same time and space based on different techniques. It has remained an elusive open problem since then whether these bounds are optimal or, assuming non-wasteful text encoding, the construction achieving $O(n / \log_{\sigma}n)$ time and space is possible. In this paper we provide a first algorithm that can achieve these bounds. We show a deterministic algorithm that constructs CSA and CST using $O(n / \log_{\sigma} n + r \log^{11} n)$ time and $O(n / \log_{\sigma} n + r \log^{10} n)$ working space, where $r$ is the number of runs in the Burrows-Wheeler transform of the input text. As one of the applications of our techniques we show how to compute the LZ77 parsing in $O(n/\log_{\sigma}n + r\log^{11}n+z\log^{10}n)$ time and $O(n/\log_{\sigma}n + r\log^{9}n)$ space, which is optimal for highly repetitive strings.

[1]  Gonzalo Navarro,et al.  Storage and Retrieval of Highly Repetitive Sequence Collections , 2010, J. Comput. Biol..

[2]  Simon J. Puglisi,et al.  Faster Approximate Pattern Matching in Compressed Repetitive Texts , 2011, ISAAC.

[3]  Gonzalo Navarro,et al.  Optimal-Time Text Indexing in BWT-runs Bounded Space , 2017, SODA.

[4]  Lucian Ilie,et al.  Computing Longest Previous Factor in linear time and applications , 2008, Inf. Process. Lett..

[5]  Gonzalo Navarro,et al.  Run-Length Compressed Indexes Are Superior for Highly Repetitive Sequence Collections , 2008, SPIRE.

[6]  Hideo Bannai,et al.  Lyndon Factorization of Grammar Compressed Texts Revisited , 2018, CPM.

[7]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[8]  Gonzalo Navarro,et al.  Practical approaches to reduce the space requirement of lempel-ziv--based compressed text indices , 2010, JEAL.

[9]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[10]  Gaston H. Gonnet,et al.  New Indices for Text: Pat Trees and Pat Arrays , 1992, Information Retrieval: Data Structures & Algorithms.

[11]  Jens Stoye,et al.  Linear time algorithms for finding and representing all the tandem repeats in a string , 2004, J. Comput. Syst. Sci..

[12]  Gonzalo Navarro,et al.  On the Approximation Ratio of Lempel-Ziv Parsing , 2018, LATIN.

[13]  Peter Sanders,et al.  Simple Linear Work Suffix Array Construction , 2003, ICALP.

[14]  R. Lyndon,et al.  Free Differential Calculus, IV. The Quotient Groups of the Lower Central Series , 1958 .

[15]  Arnaud Lefebvre,et al.  Linear-time computation of local periods , 2004, Theor. Comput. Sci..

[16]  Roberto Grossi,et al.  Compressed suffix arrays and suffix trees with applications to text indexing and string matching (extended abstract) , 2000, STOC '00.

[17]  Kunihiko Sadakane,et al.  Succinct representations of lcp information and improvements in the compressed suffix arrays , 2002, SODA '02.

[18]  I Tomohiro,et al.  Longest Common Extensions with Recompression , 2016, CPM.

[19]  John L. Smith Tables , 1969, Neuromuscular Disorders.

[20]  Gonzalo Navarro,et al.  Space-efficient construction of Lempel-Ziv compressed text indexes , 2011, Inf. Comput..

[21]  Wing-Kai Hon,et al.  Constructing Compressed Suffix Arrays with Large Alphabets , 2003, ISAAC.

[22]  Gonzalo Navarro Wavelet trees for all , 2014, J. Discrete Algorithms.

[23]  Esko Ukkonen,et al.  Lempel-Ziv parsing and sublinear-size index structures for string matching , 1996 .

[24]  Juha Kärkkäinen,et al.  A Faster Grammar-Based Self-index , 2011, LATA.

[25]  Hector Ferrada,et al.  Hybrid Indexing Revisited , 2018, ALENEX.

[26]  Gonzalo Navarro,et al.  Alphabet-Independent Compressed Text Indexing , 2011, ESA.

[27]  Wing-Kai Hon,et al.  Breaking a time-and-space barrier in constructing full-text indices , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[28]  Juha Kärkkäinen,et al.  On the Size of Lempel-Ziv and Lyndon Factorizations , 2017, STACS.

[29]  Dominik Kempa,et al.  At the roots of dictionary compression: string attractors , 2017, STOC.

[30]  Siu-Ming Yiu,et al.  A Space and Time Efficient Algorithm for Constructing Compressed Suffix Arrays , 2002, COCOON.

[31]  Kazuya Tsuruta,et al.  The "Runs" Theorem , 2014, SIAM J. Comput..

[32]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[33]  Yasuo Tabei,et al.  Queries on LZ-Bounded Encodings , 2014, 2015 Data Compression Conference.

[34]  Wojciech Rytter,et al.  A Linear-Time Algorithm for Seeds Computation , 2011, SODA.

[35]  Juha Kärkkäinen,et al.  Slashing the Time for BWT Inversion , 2012, 2012 Data Compression Conference.

[36]  Hideo Bannai,et al.  Fully Dynamic Data Structure for LCE Queries in Compressed Space , 2016, MFCS.

[37]  Djamal Belazzougui,et al.  Linear time construction of compressed text indices in compact space , 2014, STOC.

[38]  Juha Kärkkäinen,et al.  Permuted Longest-Common-Prefix Array , 2009, CPM.

[39]  Juha Kärkkäinen,et al.  Tighter Bounds for the Sum of Irreducible LCP Values , 2015, CPM.

[40]  Juha Kärkkäinen Repetition-Based Text Indexes , 1999 .

[41]  Paolo Ferragina,et al.  Indexing compressed text , 2005, JACM.

[42]  Philip Bille,et al.  Time-Space Trade-Offs for Lempel-Ziv Compressed Indexing , 2017, CPM.

[43]  David Richard Clark,et al.  Compact pat trees , 1998 .

[44]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[45]  Kunihiko Sadakane,et al.  Compressed Suffix Trees with Full Functionality , 2007, Theory of Computing Systems.

[46]  Juha Kärkkäinen,et al.  LZ77-Based Self-indexing with Faster Pattern Matching , 2014, LATIN.

[47]  Juha Kärkkäinen,et al.  Lazy Lempel-Ziv Factorization Algorithms , 2016, ACM J. Exp. Algorithmics.

[48]  Wojciech Rytter,et al.  Application of Lempel-Ziv factorization to the approximation of grammar-based compression , 2002, Theor. Comput. Sci..

[49]  Gonzalo Navarro,et al.  Stronger Lempel-Ziv Based Compressed Text Indexing , 2012, Algorithmica.

[50]  Sen Zhang,et al.  Two Efficient Algorithms for Linear Time Suffix Array Construction , 2011, IEEE Transactions on Computers.

[51]  Abhi Shelat,et al.  The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[52]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[53]  Roberto Grossi,et al.  Compressed Suffix Arrays and Suffix Trees with Applications to Text Indexing and String Matching , 2005, SIAM J. Comput..

[54]  Wing-Kai Hon,et al.  Breaking a Time-and-Space Barrier in Constructing Full-Text Indices , 2009, SIAM J. Comput..

[55]  Srinivas Aluru,et al.  Space efficient linear time construction of suffix arrays , 2005, J. Discrete Algorithms.

[56]  Hiroki Arimura,et al.  Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications , 2001, CPM.

[57]  Dong Kyue Kim,et al.  Linear-Time Construction of Suffix Arrays , 2003, CPM.

[58]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[59]  Juha Kärkkäinen,et al.  Tighter bounds for the sum of irreducible LCP values , 2016, Theor. Comput. Sci..