Improving Semistatic Compression Via Pair-Based Coding

In the last years, new semistatic word-based byte-oriented compressors, such as Plain and Tagged Huffman and the Dense Codes, have been used to improve the efficiency of text retrieval systems, while reducing the compressed collections to 30-35% of their original size. In this paper, we present a new semistatic compressor, called Pair-Based End-Tagged Dense Code (PETDC). PETDC compresses English texts to 27-28%, overcoming the optimal 0-order prefix-free semistatic compressor (Plain Huffman) in more than 3 percentage points. Moreover, PETDC permits also random decompression, and direct searches using fast Boyer-Moore algorithms. PETDC builds a vocabulary with both words and pairs of words. The basic idea in which PETDC is based is that, since each symbol in the vocabulary is given a codeword, compression is improved by replacing two words of the source text by a unique codeword.

[1]  Jan Platos,et al.  Word-Based Text Compression , 2008, ArXiv.

[2]  Daniel Sunday,et al.  A very fast substring search algorithm , 1990, CACM.

[3]  R. Nigel Horspool,et al.  Practical fast searching in strings , 1980, Softw. Pract. Exp..

[4]  Gonzalo Navarro,et al.  An Efficient Compression Code for Text Databases , 2003, ECIR.

[5]  Mikkel Thorup,et al.  String matching in Lempel-Ziv compressed strings , 1995, STOC '95.

[6]  Donald E. Knuth,et al.  Fast Pattern Matching in Strings , 1977, SIAM J. Comput..

[7]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[8]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[9]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[10]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[11]  Ricardo A. Baeza-Yates,et al.  Adding Compression to Block Addressing Inverted Indexes , 2000, Information Retrieval.

[12]  Gonzalo Navarro,et al.  Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences , 2002 .

[13]  Gonzalo Navarro,et al.  Lightweight natural language text compression , 2006, Information Retrieval.

[14]  Gonzalo Navarro,et al.  Simple, Fast, and Efficient Natural Language Adaptive Compression , 2004, SPIRE.

[15]  Ricardo A. Baeza-Yates,et al.  Fast and flexible word searching on compressed text , 2000, TOIS.

[16]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[17]  Gonzalo Navarro,et al.  Boyer-Moore String Matching over Ziv-Lempel Compressed Text , 2000, CPM.