New adaptive compressors for natural language text

Semistatic byte-oriented word-based compression codes have been shown to be an attractive alternative to compress natural language text databases, because of the combination of speed, effectiveness, and direct searchability they offer. In particular, our recently proposed family of dense compression codes has been shown to be superior to the more traditional byte-oriented word-based Huffman codes in most aspects. In this paper, we focus on the problem of transmitting texts among peers that do not share the vocabulary. This is the typical scenario for adaptive compression methods. We design adaptive variants of our semistatic dense codes, showing that they are much simpler and faster than dynamic Huffman codes and reach almost the same compression effectiveness. We show that our variants have a very compelling trade-off between compression-decompression speed, compression ratio, and search speed compared with most of the state-of-the-art general compressors. Copyright © 2008 John Wiley & Sons, Ltd. A preliminary partial version on this work appeared in [1]

[1]  Maxime Crochemore,et al.  Factor Oracle: A New Structure for Pattern Matching , 1999, SOFSEM.

[2]  Gonzalo Navarro,et al.  Building extensible routers using network processors: Research Articles , 2005 .

[3]  Alistair Moffat,et al.  Compression and Coding Algorithms , 2005, IEEE Trans. Inf. Theory.

[4]  Gonzalo Navarro,et al.  Simple, Fast, and Efficient Natural Language Adaptive Compression , 2004, SPIRE.

[5]  Alistair Moffat,et al.  Word‐based text compression , 1989, Softw. Pract. Exp..

[6]  H. S. Heaps,et al.  Information retrieval, computational and theoretical aspects , 1978 .

[7]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[8]  Robert S. Boyer,et al.  A fast string searching algorithm , 1977, CACM.

[9]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[10]  Gonzalo Navarro,et al.  LZgrep: a Boyer–Moore string matching tool for Ziv–Lempel compressed text , 2005, Softw. Pract. Exp..

[11]  J. Shane Culpepper,et al.  Enhanced Byte Codes with Restricted Prefix Properties , 2005, SPIRE.

[12]  Alistair Moffat,et al.  Fast file search using text compression , 1997 .

[13]  Gonzalo Navarro,et al.  Lightweight natural language text compression , 2006, Information Retrieval.

[14]  D. J. Wheeler,et al.  A Block-sorting Lossless Data Compression Algorithm , 1994 .

[15]  R. Nigel Horspool,et al.  Practical fast searching in strings , 1980, Softw. Pract. Exp..

[16]  Hugh E. Williams,et al.  Compressing Integers for Fast File Access , 1999, Comput. J..

[17]  Donald E. Knuth,et al.  Dynamic Huffman Coding , 1985, J. Algorithms.

[18]  Nieves R. Brisaboa,et al.  New Compression Codes for Text Databases , 2005 .

[19]  Robert E. Tarjan,et al.  A Locally Adaptive Data , 1986 .

[20]  S. Golomb Run-length encodings. , 1966 .

[21]  Ricardo A. Baeza-Yates,et al.  Fast and flexible word searching on compressed text , 2000, TOIS.

[22]  Udi Manber,et al.  Fast text searching: allowing errors , 1992, CACM.

[23]  Gonzalo Navarro,et al.  Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences , 2002 .

[24]  Gonzalo Navarro,et al.  Efficiently decodable and searchable natural language adaptive compression , 2005, SIGIR '05.

[25]  Solomon W. Golomb,et al.  Run-length encodings (Corresp.) , 1966, IEEE Trans. Inf. Theory.