HOLZ: High-Order Entropy Encoding of Lempel-Ziv Factor Distances

We propose a new representation of the offsets of the Lempel–Ziv (LZ) factorization based on the co-lexicographic order of the text’s prefixes. The selected offsets tend to approach the k-th order empirical entropy. Our evaluations show that this choice is superior to the rightmost and bit-optimal LZ parsings on datasets with small high-order entropy.

[1]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[2]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[3]  Alberto Policriti,et al.  LZ77 Computation Based on the Run-Length Encoded BWT , 2018, Algorithmica.

[4]  Enno Ohlebusch,et al.  Lempel-Ziv Factorization Revisited , 2011, CPM.

[5]  Simon J. Puglisi,et al.  Range Predecessor and Lempel-Ziv Parsing , 2016, SODA.

[6]  G. Navarro Indexing Highly Repetitive String Collections, Part I: Repetitiveness Measures , 2020 .

[7]  Paolo Ferragina,et al.  On the Bit-Complexity of Lempel-Ziv Compression , 2009, SIAM J. Comput..

[8]  J. Ian Munro,et al.  Compressed Data Structures for Dynamic Sequences , 2015, ESA.

[9]  Giovanni Manzini,et al.  Opportunistic data structures with applications , 2000, Proceedings 41st Annual Symposium on Foundations of Computer Science.

[10]  Abraham Lempel,et al.  On the Complexity of Finite Sequences , 1976, IEEE Trans. Inf. Theory.

[11]  A. D. Wyner,et al.  The sliding-window Lempel-Ziv algorithm is asymptotically optimal , 1994, Proc. IEEE.

[12]  G SzymanskiThomas,et al.  Data compression via textual substitution , 1982 .

[13]  Eugene W. Myers,et al.  Suffix arrays: a new method for on-line string searches , 1993, SODA '90.

[14]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[15]  Lucian Ilie,et al.  A Simple Algorithm for Computing the Lempel Ziv Factorization , 2008, Data Compression Conference (dcc 2008).