Re-pair Achieves High-Order Entropy

Re-pair is a dictionary-based compression method invented in 1999 by J. Larsson and A. Moffat [Off-line dictionary-based compression. Proc. IEEE, 88(11):1722-1732, 2000], lacking up to now an efficiency analysis. We show that re-pair compresses a sequence T[1,n] over an alphabet of size sigma to at most 2nHk + o(n log sigma) bits, for any k = o(logsigma n), where Hk is either the classical information-theory or the empirical k-th order entropy (in the latter, the model is inferred from the sequence statistics).

[1]  A. Apostolico,et al.  Off-line compression by greedy textual substitution , 2000, Proceedings of the IEEE.

[2]  Alistair Moffat,et al.  Off-line dictionary-based compression , 1999, Proceedings of the IEEE.

[3]  Gonzalo Navarro,et al.  A Fast and Compact Web Graph Representation , 2007, SPIRE.

[4]  En-Hui Yang,et al.  Grammar-based codes: A new class of universal lossless source codes , 2000, IEEE Trans. Inf. Theory.

[5]  Rodrigo González,et al.  Compressed Text Indexes with Fast Locate , 2007, CPM.

[6]  Abhi Shelat,et al.  The smallest grammar problem , 2005, IEEE Transactions on Information Theory.

[7]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[8]  Gonzalo Navarro,et al.  Compressed full-text indexes , 2007, CSUR.

[9]  I.H. Witten,et al.  On-line and off-line heuristics for inferring hierarchies of repetitions in sequences , 2000, Proceedings of the IEEE.

[10]  Giovanni Manzini,et al.  An analysis of the Burrows-Wheeler transform , 2001, SODA '99.

[11]  Ayumi Shinohara,et al.  Collage system: a unifying framework for compressed pattern matching , 2003, Theor. Comput. Sci..

[12]  Ian H. Witten,et al.  Identifying Hierarchical Structure in Sequences: A linear-time algorithm , 1997, J. Artif. Intell. Res..

[13]  Giovanni Manzini,et al.  Compression of Low Entropy Strings with Lempel-Ziv Algorithms , 1999, SIAM J. Comput..

[14]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[15]  Travis Gagie,et al.  Large alphabets and incompressibility , 2005, Inf. Process. Lett..

[16]  Raymond Wan,et al.  Browsing and searching compressed documents , 2003 .