论文信息 - A Data Compression Scheme for Chinese Text Files Using Huffman Coding and a Two-Level Dictionary

A Data Compression Scheme for Chinese Text Files Using Huffman Coding and a Two-Level Dictionary

Abstract This paper presents a data compression scheme for Chinese text files. Due to the skewness of the distribution of Chinese ideograms, the Huffman coding method is adopted. By storing the frequencies of the encoding symbols rather than their Huffman codes in a dictionary, applying differential coding where it saves space, and structuring the dictionary in the Huffman coding scheme into a two-level dictionary structure, the algorithm produces significant improvement on the compression results. The proposed method is evaluated by comparing its performance with three well-known compression algorithms. This algorithm should also be applicable to other ideogram-based or oriental-language texts. Also, it has the potential to reduce the dictionary size in a bigram- or trigram-based semi-adaptive compression scheme for English texts.

Ghim Hwee Ong | Shell-Ying Huang | Shell-Ying Huang | G. Ong

[1] Ian H. Witten,et al. Modeling for text compression , 1989, CSUR.

[2] Abraham Lempel,et al. Compression of individual sequences via variable-rate coding , 1978, IEEE Trans. Inf. Theory.

[3] Abraham Lempel,et al. A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[4] Hassan K. Reghbati,et al. Special Feature An Overview of Data Compression Techniques , 1981, Computer.

[5] Robert G. Gallager,et al. Variations on a theme by Huffman , 1978, IEEE Trans. Inf. Theory.

[6] H. E. White. Printed english compression by dictionary encoding , 1967 .

[7] D. Huffman. A Method for the Construction of Minimum-Redundancy Codes , 1952 .

[8] Jeffrey Scott Vitter,et al. Design and analysis of dynamic Huffman codes , 1987, JACM.

[9] Mostafa A. Bassiouni,et al. Data Compression in Scientific and Statistical Databases , 1985, IEEE Transactions on Software Engineering.

[10] David Cooper,et al. Text compression using variable-to fixed-length encodings , 1982, J. Am. Soc. Inf. Sci..

[11] Terry A. Welch,et al. A Technique for High-Performance Data Compression , 1984, Computer.

[12] H. S. Heaps. Data Compression of Large Document Data Bases , 1975, J. Chem. Inf. Comput. Sci..

[13] Daniel S. Hirschberg,et al. Data compression , 1987, CSUR.