(S, C)-Dense Coding: An Optimized Compression Code for Natural Language Text Databases

This work presents (s,c)-Dense Code, a new method for compressing natural language texts. This technique is a generalization of a previous compression technique called End-Tagged Dense Code that obtains better compression ratio as well as a simpler and faster encoding than Tagged Huffman. At the same time, (s,c)-Dense Code is a prefix code that maintains the most interesting features of Tagged Huffman Code with respect to direct search on the compressed text. (s,c)-Dense Coding retains all the efficiency and simplicity of Tagged Huffman, and improves its compression ratios.

[1]  Ricardo A. Baeza-Yates,et al.  Compression: A Key for Next-Generation Text Retrieval Systems , 2000, Computer.

[2]  Ricardo A. Baeza-Yates,et al.  Fast and flexible word searching on compressed text , 2000, TOIS.

[3]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[4]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[5]  Udi Manber,et al.  GLIMPSE: A Tool to Search Through Entire File Systems , 1994, USENIX Winter.

[6]  Gonzalo Navarro,et al.  Boyer-Moore String Matching over Ziv-Lempel Compressed Text , 2000, CPM.

[7]  Ricardo A. Baeza-Yates,et al.  Adding Compression to Block Addressing Inverted Indexes , 2000, Information Retrieval.

[8]  David A. Huffman,et al.  A method for the construction of minimum-redundancy codes , 1952, Proceedings of the IRE.

[9]  Jorma Tarhio,et al.  String Matching with Stopper Encoding and Code Splitting , 2002, CPM.

[10]  Alistair Moffat,et al.  On the implementation of minimum-redundancy prefix codes , 1996, Proceedings of Data Compression Conference - DCC '96.

[11]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[12]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[13]  Alistair Moffat,et al.  Word‐based text compression , 1989, Softw. Pract. Exp..

[14]  Ricardo A. Baeza-Yates,et al.  Fast searching on compressed text allowing errors , 1998, SIGIR '98.

[15]  Gonzalo Navarro,et al.  An Efficient Compression Code for Text Databases , 2003, ECIR.

[16]  Hugh E. Williams,et al.  Compression of inverted indexes For fast query evaluation , 2002, SIGIR '02.