Universal Compressed Text Indexing

The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows-Wheeler transform and on the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on the other hand, are based on the Lempel-Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of text positions capturing all distinct substrings. Starting from this observation, in this paper we develop the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can therefore be built on top of any dictionary-compressed text representation. Let $\gamma$ be the size of a string attractor for a text of length $n$. Our index takes $O(\gamma\log(n/\gamma))$ words of space and supports locating the $occ$ occurrences of any pattern of length $m$ in $O(m\log n + occ\log^{\epsilon}n)$ time, for any constant $\epsilon>0$. This is, in particular, the first index for general macro schemes and collage systems. Our result shows that the relation between indexing and compression is much deeper than what was previously thought: the simple property standing at the core of all dictionary compressors is sufficient to support fast indexed queries.

[1]  Abhi Shelat,et al.  The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[2]  Gonzalo Navarro,et al.  Improved Grammar-Based Compressed Indexes , 2012, SPIRE.

[3]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[4]  Gonzalo Navarro A Self-index on Block Trees , 2017, SPIRE.

[5]  Dominik Kempa,et al.  At the roots of dictionary compression: string attractors , 2017, STOC.

[6]  Volker Heun,et al.  Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays , 2011, SIAM J. Comput..

[7]  Fabio Cunial,et al.  Fast Label Extraction in the CDAWG , 2017, SPIRE.

[8]  Ayumi Shinohara,et al.  Collage system: a unifying framework for compressed pattern matching , 2003, Theor. Comput. Sci..

[9]  Artur Jez,et al.  Approximation of grammar-based compression via recompression , 2013, Theor. Comput. Sci..

[10]  En-Hui Yang,et al.  Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[11]  Mikkel Thorup,et al.  Time-space trade-offs for predecessor search , 2006, STOC '06.

[12]  Dan E. Willard Examining Computational Geometry, Van Emde Boas Trees, and Hashing from the Perspective of the Fusion Tree , 2000, SIAM J. Comput..

[13]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[14]  Travis Gagie,et al.  Large alphabets and incompressibility , 2005, Inf. Process. Lett..

[15]  Juha Kärkkäinen,et al.  LZ77-Based Self-indexing with Faster Pattern Matching , 2014, LATIN.

[16]  Philip Bille,et al.  Time-Space Trade-Offs for Lempel-Ziv Compressed Indexing , 2017, CPM.

[17]  Timothy M. Chan,et al.  Orthogonal range searching on the RAM, revisited , 2011, SoCG '11.

[18]  Richard M. Karp,et al.  Efficient Randomized Pattern-Matching Algorithms , 1987, IBM J. Res. Dev..

[19]  Giovanni Manzini,et al.  Indexing compressed text , 2005, JACM.

[20]  Gonzalo Navarro,et al.  Optimal-Time Text Indexing in BWT-runs Bounded Space , 2017, SODA.

[21]  Juha Kärkkäinen,et al.  A Faster Grammar-Based Self-index , 2011, LATA.

[22]  Hideo Bannai,et al.  Dynamic Index and LZ Factorization in Compressed Space , 2016, Stringology.

[23]  James A. Storer,et al.  Data compression via textual substitution , 1982, JACM.

[24]  Peter Sanders,et al.  Linear work suffix array construction , 2006, JACM.

[25]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[26]  David Haussler,et al.  Complete inverted files for efficient text retrieval and analysis , 1987, JACM.

[27]  S. Muthukrishnan,et al.  Efficient algorithms for document retrieval problems , 2002, SODA '02.

[28]  Peter Weiner,et al.  Linear Pattern Matching Algorithms , 1973, SWAT.

[29]  Dominik Kempa,et al.  LZ-End Parsing in Compressed Space , 2016, 2017 Data Compression Conference (DCC).

[30]  Gonzalo Navarro,et al.  Self-Indexed Grammar-Based Compression , 2011, Fundam. Informaticae.

[31]  Esko Ukkonen,et al.  Lempel-Ziv parsing and sublinear-size index structures for string matching , 1996 .

[32]  Milan Ruzic,et al.  Constructing Efficient Dictionaries in Close to Sorting Time , 2008, ICALP.

[33]  Hiroshi Sakamoto,et al.  A fully linear-time approximation algorithm for grammar-based compression , 2003, J. Discrete Algorithms.

[34]  Longest Common Extensions in Sublinear Space , 2015, CPM.

[35]  Simon J. Puglisi,et al.  Range Predecessor and Lempel-Ziv Parsing , 2016, SODA.

[36]  Gad M. Landau,et al.  Construction of Aho Corasick automaton in linear time for integer alphabets , 2006, Inf. Process. Lett..

[37]  Artur Jez A really simple approximation of smallest grammar , 2016, Theor. Comput. Sci..

[38]  Philip Bille,et al.  Time-space trade-offs for Lempel-Ziv compressed indexing , 2018, Theor. Comput. Sci..

[39]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[40]  Yasuo Tabei,et al.  Queries on LZ-Bounded Encodings , 2014, 2015 Data Compression Conference.

[41]  Hideo Bannai,et al.  Fully Dynamic Data Structure for LCE Queries in Compressed Space , 2016, MFCS.

[42]  Yijie Han Deterministic sorting in O(nlog log n) time and linear space , 2002, STOC '02.

[43]  Pawel Gawrychowski,et al.  Sparse Suffix Tree Construction in Optimal Time and Space , 2017, SODA.

[44]  Mikko Berggren Ettienne,et al.  Compressed Indexing with Signature Grammars , 2018, LATIN.

[45]  Gonzalo Navarro,et al.  On compressing and indexing repetitive sequences , 2013, Theor. Comput. Sci..

[46]  Maxime Crochemore,et al.  Direct Construction of Compact Directed Acyclic Word Graphs , 1997, CPM.

[47]  Sebastiano Vigna,et al.  Fast Prefix Search in Little Space, with Applications , 2010, ESA.

[48]  Miguel A. Martínez-Prieto,et al.  Universal indexes for highly repetitive document collections , 2016, Inf. Syst..

[49]  Wojciech Rytter,et al.  Application of Lempel-Ziv factorization to the approximation of grammar-based compression , 2002, Theor. Comput. Sci..

[50]  Mathieu Raffinot,et al.  Composite Repetition-Aware Data Structures , 2015, CPM.

[51]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[52]  Hideo Bannai,et al.  Dynamic index, LZ factorization, and LCE queries in compressed space , 2015, ArXiv.

[53]  Gonzalo Navarro,et al.  On the Approximation Ratio of Lempel-Ziv Parsing , 2018, LATIN.

[54]  Yijie Han,et al.  Deterministic sorting in O(nloglogn) time and linear space , 2004, J. Algorithms.