An Efficient Compression Code for Text Databases

We present a new compression format for natural language texts, allowing both exact and approximate search without decompression. This new code -called End-Tagged Dense Code- has some advantages with respect to other compression techniques with similar features such as the Tagged Huffman Code of [Moura et al., ACM TOIS 2000]. Our compression method obtains (i) better compression ratios, (ii) a simpler vocabulary representation, and (iii) a simpler and faster encoding. At the same time, it retains the most interesting features of the method based on the Tagged Huffman Code, i.e., exact search for words and phrases directly on the compressed text using any known sequential pattern matching algorithm, efficient word-based approximate and extended searches without any decoding, and efficient decompression of arbitrary portions of the text. As a side effect, our analytical results give new upper and lower bounds for the redundancy of d-ary Huffman codes.

[1]  Ian H. Witten,et al.  Managing gigabytes , 1994 .

[2]  Abraham Lempel,et al.  A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[3]  D. Huffman A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[4]  Alistair Moffat,et al.  On the implementation of minimum-redundancy prefix codes , 1996, Proceedings of Data Compression Conference - DCC '96.

[5]  Abraham Lempel,et al.  Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[6]  Gonzalo Navarro,et al.  Boyer-Moore String Matching over Ziv-Lempel Compressed Text , 2000, CPM.

[7]  Dietrich Manstetten Tight bounds on the redundancy of Huffman codes , 1992, IEEE Trans. Inf. Theory.

[8]  Alfredo De Santis,et al.  On Lower Bounds for the Redundancy of Optimal Codes , 1998, Des. Codes Cryptogr..

[9]  Ricardo A. Baeza-Yates,et al.  Adding Compression to Block Addressing Inverted Indexes , 2000, Information Retrieval.

[10]  Alistair Moffat,et al.  Word‐based text compression , 1989, Softw. Pract. Exp..

[11]  George Kingsley Zipf,et al.  Human behavior and the principle of least effort , 1949 .

[12]  Ian H. Witten,et al.  Text Compression , 1990, 125 Problems in Text Algorithms.

[13]  Ricardo A. Baeza-Yates,et al.  Fast and flexible word searching on compressed text , 2000, TOIS.

[14]  Gonzalo Navarro,et al.  Flexible Pattern Matching in Strings: Practical On-Line Search Algorithms for Texts and Biological Sequences , 2002 .

[15]  Ricardo A. Baeza-Yates,et al.  Compression: A Key for Next-Generation Text Retrieval Systems , 2000, Computer.