A Self-index on Block Trees

The Block Tree is a recently proposed data structure that reaches compression close to Lempel-Ziv while supporting efficient direct access to text substrings. In this paper we show how a self-index can be built on top of a Block Tree so that it provides efficient pattern searches while using space proportional to that of the original data structure. More precisely, if a Lempel-Ziv parse cuts a text of length $n$ into $z$ non-overlapping phrases, then our index uses $O(z\log(n/z))$ words and finds the $occ$ occurrences of a pattern of length $m$ in time $O(m\log n+occ\log^\epsilon n)$ for any constant $\epsilon>0$.

[1]  Yasuo Tabei,et al.  Queries on LZ-Bounded Encodings , 2014, 2015 Data Compression Conference.

[2]  Hiroshi Sakamoto,et al.  A fully linear-time approximation algorithm for grammar-based compression , 2003, J. Discrete Algorithms.

[3]  Artur Jez,et al.  Approximation of grammar-based compression via recompression , 2013, Theor. Comput. Sci..

[4]  Miguel A. Martínez-Prieto,et al.  Universal indexes for highly repetitive document collections , 2016, Inf. Syst..

[5]  Wojciech Rytter Application of Lempel-Ziv factorization to the approximation of grammar-based compression , 2003, Theor. Comput. Sci..

[6]  Timothy M. Chan,et al.  Orthogonal range searching on the RAM, revisited , 2011, SoCG '11.

[7]  Philip Bille,et al.  Time-space trade-offs for Lempel-Ziv compressed indexing , 2018, Theor. Comput. Sci..

[8]  Volker Heun,et al.  Space-Efficient Preprocessing Schemes for Range Minimum Queries on Static Arrays , 2011, SIAM J. Comput..

[9]  Hideo Bannai,et al.  Dynamic index, LZ factorization, and LCE queries in compressed space , 2015, ArXiv.

[10]  S. Muthukrishnan,et al.  Efficient algorithms for document retrieval problems , 2002, SODA '02.

[11]  Rajeev Raman,et al.  Succinct representations of permutations and functions , 2011, Theor. Comput. Sci..

[12]  Rajeev Raman,et al.  Succinct Representations of Permutations , 2003, ICALP.

[13]  Kunihiko Sadakane,et al.  Practical Entropy-Compressed Rank/Select Dictionary , 2006, ALENEX.

[14]  Rajeev Raman,et al.  On the Redundancy of Succinct Data Structures , 2008, SWAT.

[15]  Abhi Shelat,et al.  The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[16]  Gonzalo Navarro,et al.  Improved Grammar-Based Compressed Indexes , 2012, SPIRE.

[17]  Juha Kärkkäinen,et al.  LZ77-Based Self-indexing with Faster Pattern Matching , 2014, LATIN.

[18]  Gonzalo Navarro,et al.  Self-Indexed Grammar-Based Compression , 2011, Fundam. Informaticae.

[19]  Esko Ukkonen,et al.  Lempel-Ziv parsing and sublinear-size index structures for string matching , 1996 .

[20]  Gonzalo Navarro,et al.  Wavelet trees for all , 2012, J. Discrete Algorithms.

[21]  Gonzalo Navarro,et al.  Alphabet-Independent Compressed Text Indexing , 2011, TALG.

[22]  Gonzalo Navarro,et al.  On compressing and indexing repetitive sequences , 2013, Theor. Comput. Sci..

[23]  J. Ian Munro,et al.  Document Listing on Versioned Documents , 2013, SPIRE.

[24]  Juha Kärkkäinen,et al.  A Faster Grammar-Based Self-index , 2011, LATA.

[25]  Wojciech Rytter,et al.  Application of Lempel-Ziv factorization to the approximation of grammar-based compression , 2002, Theor. Comput. Sci..

[26]  Gonzalo Navarro,et al.  Spaces, Trees, and Colors , 2013, ACM Comput. Surv..

[27]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[28]  Donald R. Morrison,et al.  PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.

[29]  David Richard Clark,et al.  Compact pat trees , 1998 .

[30]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[31]  Artur Jez A really simple approximation of smallest grammar , 2016, Theor. Comput. Sci..

[32]  Gonzalo Navarro,et al.  Optimal-Time Text Indexing in BWT-runs Bounded Space , 2017, SODA.