Compressing Huffman Models on Large Alphabets

A naive storage of a Huffman model on a text of length n over an alphabet of size σ requires O(σlog n) bits. This can be reduced to σ logσ + O(σ) bits using canonical codes. This overhead over the entropy can be significant when σ is comparable to n, and it also dictates the amount of main memory required to compress or decompress. We design an encoding scheme that requires σlog log n+O(σ+log2 n) bits in the worst case, and typically less, while supporting encoding and decoding of symbols in O(log log n) time. We show that our technique reduces the storage size of the model of state-of-the-art techniques to around 15% in various real-life sequences over large alphabets, while still offering reasonable compression/decompression times.

[1]  Gonzalo Navarro Wavelet trees for all , 2014, J. Discrete Algorithms.

[2]  Donald E. Knuth,et al.  The art of computer programming: sorting and searching (volume 3) , 1973 .

[3]  Gonzalo Navarro,et al.  Compressed Representations of Permutations, and Applications , 2009, STACS.

[4]  David Richard Clark,et al.  Compact pat trees , 1998 .

[5]  Rajeev Raman,et al.  Succinct indexable dictionaries with applications to encoding k-ary trees and multisets , 2002, SODA '02.

[6]  John L. Smith Tables , 1969, Neuromuscular Disorders.

[7]  Gonzalo Navarro,et al.  Fast and Compact Prefix Codes , 2010, SOFSEM.

[8]  Gonzalo Navarro,et al.  Implementing the LZ-index: Theory versus practice , 2009, JEAL.

[9]  Gonzalo Navarro,et al.  Fast and Compact Web Graph Representations , 2010, TWEB.

[10]  Roberto Grossi,et al.  High-order entropy-compressed text indexes , 2003, SODA '03.

[11]  Ricardo A. Baeza-Yates,et al.  Fast and flexible word searching on compressed text , 2000, TOIS.

[12]  David Thomas,et al.  The Art in Computer Programming , 2001 .

[13]  Gonzalo Navarro,et al.  Practical Rank/Select Queries over Arbitrary Sequences , 2008, SPIRE.

[14]  Alistair Moffat,et al.  Word‐based text compression , 1989, Softw. Pract. Exp..

[15]  Gonzalo Navarro,et al.  Lightweight natural language text compression , 2006, Information Retrieval.

[16]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[17]  G.G. Langdon,et al.  Data compression , 1988, IEEE Potentials.

[18]  Gonzalo Navarro,et al.  Dynamic lightweight text compression , 2010, TOIS.

[19]  Donald E. Knuth,et al.  Computer programming as an art , 1974, CACM.

[20]  Eugene S. Schwartz,et al.  Generating a canonical prefix encoding , 1964, CACM.

[21]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[22]  Donald E. Knuth,et al.  Sorting and Searching , 1973 .

[23]  Gonzalo Navarro,et al.  Implicit indexing of natural language text by reorganizing bytecodes , 2012, Information Retrieval.

[24]  Raffaele Giancarlo,et al.  Boosting textual compression in optimal linear time , 2005, JACM.