论文信息 - Dictionary-based order-preserving string compression

Dictionary-based order-preserving string compression

Abstract.As no database exists without indexes, no index implementation exists without order-preserving key compression, in particular, without prefix and tail compression. However, despite the great potentials of making indexes smaller and faster, application of general compression methods to ordered data sets has advanced very little. This paper demonstrates that the fast dictionary-based methods can be applied to order-preserving compression almost with the same freedom as in the general case. The proposed new technology has the same speed and a compression rate only marginally lower than the traditional order-indifferent dictionary encoding. Procedures for encoding and generating the encode tables are described covering such order-related features as ordered data set restrictions, sensitivity and insensitivity to a character position, and one-symbol encoding of each frequent trailing character sequence. The experimental results presented demonstrate five-folded compression on real-life data sets and twelve-folded compression on Wisconsin benchmark text fields.

Gennady Antoshenkov | G. Antoshenkov

[1] T. C. Hu,et al. Optimal Computer Search Trees and Variable-Length Alphabetical Codes , 1971 .

[2] Rudolf Bayer,et al. Prefix B-trees , 1977, TODS.

[3] Richard G. Casey,et al. An encoding method for multifield sorting and indexing , 1977, CACM.

[4] Jean-Loup Baer,et al. Improving Quicksort Performance with a Codewort Data Structure , 1989, IEEE Trans. Software Eng..

[5] Glen G. Langdon,et al. Sort order preserving data compression for extended alphabets , 1993, [Proceedings] DCC `93: Data Compression Conference.

[6] Abraham Lempel,et al. Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[7] Jim Gray,et al. Benchmark Handbook: For Database and Transaction Processing Systems , 1992 .

[8] David B. Lomet,et al. Order Preserving Key Compression , 1994 .

[9] D. Huffman. A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[10] Ian H. Witten,et al. Text Compression , 1990, 125 Problems in Text Algorithms.

[11] Roderic G. G. Cattell. The benchmark handbook for database and transaction processing systems , 1991 .

[12] Peter Elias,et al. Universal codeword sets and representations of the integers , 1975, IEEE Trans. Inf. Theory.

[13] Goetz Graefe,et al. Query evaluation techniques for large databases , 1993, CSUR.

[14] Alistair Moffat,et al. Coding for compression in full-text retrieval systems , 1992, Data Compression Conference, 1992..