Random Access to Grammar-Compressed Strings and Trees

Grammar-based compression, where one replaces a long string by a small context-free grammar that generates the string, is a simple and powerful paradigm that captures (sometimes with slight reduction in efficiency) many of the popular compression schemes, including the Lempel--Ziv family, run-length encoding, byte-pair encoding, Sequitur, and Re-Pair. In this paper, we present a novel grammar representation that allows efficient random access to any character or substring without decompressing the string. Let $S$ be a string of length $N$ compressed into a context-free grammar $\mathcal{S}$ of size $n$. We present two representations of $\mathcal{S}$ achieving $O(\log N)$ random access time, and either $O(n\cdot\alpha_k(n))$ construction time and space on the pointer machine model, or $O(n)$ construction time and space on the RAM. Here, $\alpha_k(n)$ is the inverse of the $k$th row of Ackermann's function. Our representations also efficiently support decompression of any substring in $S$: we can decompres...

[1]  N. S. Barnett,et al.  Private communication , 1969 .

[2]  P. Sellers On the Theory and Computation of Evolutionary Distances , 1974 .

[3]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[4]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[5]  Peter H. Sellers,et al.  The Theory and Computation of Evolutionary Distances: Pattern Recognition , 1980, J. Algorithms.

[6]  Robert E. Tarjan,et al.  Data structures and network algorithms , 1983, CBMS-NSF regional conference series in applied mathematics.

[7]  Terry A. Welch,et al.  A Technique for High-Performance Data Compression , 1984, Computer.

[8]  Robert E. Tarjan,et al.  Fast Algorithms for Finding Nearest Common Ancestors , 1984, SIAM J. Comput..

[9]  Robert E. Tarjan,et al.  Biased Search Trees , 1985, SIAM J. Comput..

[10]  Gad M. Landau,et al.  Fast Parallel and Serial Approximate String Matching , 1989, J. Algorithms.

[11]  Guy Jacobson,et al.  Space-efficient static trees and graphs , 1989, 30th Annual Symposium on Foundations of Computer Science.

[12]  Bernard Chazelle,et al.  Computing partial sums in multidimensional arrays , 1989, SCG '89.

[13]  D. Willard,et al.  Trans-dichotomous algorithms for minimum spanning trees and shortest paths , 1990, Proceedings [1990] 31st Annual Symposium on Foundations of Computer Science.

[14]  Michael L. Fredman,et al.  Surpassing the Information Theoretic Bound with Fusion Trees , 1993, J. Comput. Syst. Sci..

[15]  Uzi Vishkin,et al.  Recursive Star-Tree Parallel Data Structure , 1993, SIAM J. Comput..

[16]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[17]  Gary Benson,et al.  Let sleeping files lie: pattern matching in Z-compressed files , 1994, SODA '94.

[18]  János Csirik,et al.  An Improved Algorithm for Computing the Edit Distance of Run-Length Coded Strings , 1995, Inf. Process. Lett..

[19]  Esko Ukkonen,et al.  Lempel-Ziv parsing and sublinear-size index structures for string matching , 1996 .

[20]  S. Muthukrishnan,et al.  Perfect Hashing for Strings: Formalization and Algorithms , 1996, CPM.

[21]  Ian H. Witten,et al.  Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..

[22]  Udi Manber A text compression scheme that allows fast searching directly in the compressed file , 1997, TOIS.

[23]  Stephen Alstrup,et al.  Marked ancestor problems , 1998, Proceedings 39th Annual Symposium on Foundations of Computer Science (Cat. No.98CB36280).

[24]  Richard Cole,et al.  Approximate string matching: a simpler faster algorithm , 2002, SODA '98.

[25]  Stefano Lonardi,et al.  Some theory and practice of greedy off-line textual substitution , 1998, Proceedings DCC '98 Data Compression Conference (Cat. No.98TB100225).

[26]  Torben Hagerup,et al.  Sorting and Searching on the Word RAM , 1998, STACS.

[27]  S. Arikawa,et al.  Byte Pair Encoding: a Text Compression Scheme That Accelerates Pattern Matching , 1999 .

[28]  Stefano Lonardi,et al.  Compression of biological sequences by greedy off-line textual substitution , 2000, Proceedings DCC 2000. Data Compression Conference.

[29]  En-Hui Yang,et al.  Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[30]  Moshe Lewenstein,et al.  Faster algorithms for string matching with k mismatches , 2000, SODA '00.

[31]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[32]  Ayumi Shinohara,et al.  Speeding Up Pattern Matching by Text Compression , 2000, CIAC.

[33]  En-Hui Yang,et al.  Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform - Part one: Without context models , 2000, IEEE Trans. Inf. Theory.

[34]  Pamela C. Cosman,et al.  Universal lossless compression via multilevel pattern matching , 2000, IEEE Trans. Inf. Theory.

[35]  A. Apostolico,et al.  Off-line compression by greedy textual substitution , 2000, Proceedings of the IEEE.

[36]  Gonzalo Navarro,et al.  Approximate String Matching over Ziv-Lempel Compressed Text , 2000, CPM.

[37]  Setsuo Arikawa,et al.  Faster approximate string matching over compressed text , 2001, Proceedings DCC 2001. Data Compression Conference.

[38]  J. Ian Munro,et al.  Succinct Representation of Balanced Parentheses and Static Trees , 2002, SIAM J. Comput..

[39]  Gonzalo Navarro,et al.  Approximate Matching of Run-Length Compressed Strings , 2001, CPM.

[40]  Gonzalo Navarro,et al.  A guided tour to approximate string matching , 2001, CSUR.

[41]  Gad M. Landau,et al.  Edit distance of run-length encoded strings , 2002, Inf. Process. Lett..

[42]  Wojciech Rytter,et al.  Application of Lempel-Ziv factorization to the approximation of grammar-based compression , 2002, Theor. Comput. Sci..

[43]  Ayumi Shinohara,et al.  Collage system: a unifying framework for compressed pattern matching , 2003, Theor. Comput. Sci..

[44]  Time/Space Efficient Compressed Pattern Matching , 2001, Fundam. Informaticae.

[45]  Dake He,et al.  Efficient universal lossless data compression algorithms based on a greedy sequential grammar transform .2. With context models , 2000, IEEE Trans. Inf. Theory.

[46]  Gad M. Landau,et al.  Inplace 2D matching in compressed images , 2003, SODA '03.

[47]  Gad M. Landau,et al.  A Subquadratic Sequence Alignment Algorithm for Unrestricted Scoring Matrices , 2003, SIAM J. Comput..

[48]  Optimum binary search trees , 2004, Acta Informatica.

[49]  S. Srinivasa Rao,et al.  Succinct Representations of Functions , 2004, ICALP.

[50]  Michael T. Goodrich,et al.  Biased Skip Lists , 2002, Algorithmica.

[51]  Kurt Mehlhorn,et al.  Nearly optimal binary search trees , 1975, Acta Informatica.

[52]  Steven Skiena,et al.  Lowest common ancestors in trees and directed acyclic graphs , 2005, J. Algorithms.

[53]  Abhi Shelat,et al.  The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[54]  Wenfei Fan,et al.  Vectorizing and querying large XML repositories , 2005, 21st International Conference on Data Engineering (ICDE'05).

[55]  Igor Potapov,et al.  Real-time traversal in grammar-based compressed files , 2005, Data Compression Conference.

[56]  Yury Lifshits,et al.  Window Subsequence Problems for Compressed Texts , 2006, CSR.

[57]  Volker Heun,et al.  Theoretical and Practical Improvements on the RMQ-Problem, with Applications to LCA and LCE , 2006, CPM.

[58]  Yury Lifshits,et al.  Processing Compressed Texts: A Tractability Border , 2007, CPM.

[59]  Naila Rahman,et al.  Engineering succinct DOM , 2008, EDBT '08.

[60]  Sebastian Maneth,et al.  Efficient memory representation of XML document trees , 2008, Inf. Syst..

[61]  Gad M. Landau,et al.  On Cartesian Trees and Range Minimum Queries , 2009, ICALP.

[62]  Gonzalo Navarro,et al.  Self-indexed Text Compression Using Straight-Line Programs , 2009, MFCS.

[63]  Philip Bille,et al.  Improved approximate string matching and regular expression matching on Ziv-Lempel compressed texts , 2009, TALG.

[64]  Fabrizio Luccio,et al.  Compressing and indexing labeled trees, with applications , 2009, JACM.

[65]  Gonzalo Navarro,et al.  Fully-functional succinct trees , 2010, SODA '10.

[66]  Kunihiko Sadakane,et al.  Ultra-succinct representation of ordered trees with applications , 2012, J. Comput. Syst. Sci..

[67]  Elad Verbin,et al.  Data Structure Lower Bounds on Random Access to Grammar-Compressed Strings , 2013, CPM.

[68]  Raju Uma,et al.  A New Algorithm For Data Compression , 2013 .

[69]  Gonzalo Navarro,et al.  Faster Compressed Suffix Trees for Repetitive Text Collections , 2014, SEA.

[70]  Gonzalo Navarro,et al.  Faster Compressed Suffix Trees for Repetitive Collections , 2016, ACM J. Exp. Algorithmics.