论文信息 - Word-Based Text Compression

Word-Based Text Compression

Today there are many universal compression algorithms, but in most cases is for specific data better using specific algorithm - JPEG for images, MPEG for movies, etc. For textual documents there are special methods based on PPM algorithm or methods with non-character access, e.g. word-based compression. In the past, several papers describing variants of word-based compression using Huffman encoding or LZW method were published. The subject of this paper is the description of a word-based compression variant based on the LZ77 algorithm. The LZ77 algorithm and its modifications are described in this paper. Moreover, various ways of sliding window implementation and various possibilities of output encoding are described, as well. This paper also includes the implementation of an experimental application, testing of its efficiency and finding the best combination of all parts of the LZ77 coder. This is done to achieve the best compression ratio. In conclusion there is comparison of this implemented application with other word-based compression programs and with other commonly used compression programs.

Jan Platos | Jiri Dvorský

[1] T. Bell,et al. Better OPM/L Text Compression , 1986, IEEE Trans. Commun..

[2] Michael Rodeh,et al. Linear Algorithm for Data Compression via String Matching , 1981, JACM.

[3] James A. Storer,et al. Data compression via textual substitution , 1982, JACM.

[4] Alistair Moffat,et al. Economical Inversion of Large Text Files , 1992, Comput. Syst..

[5] Timothy C. Bell,et al. A corpus for the evaluation of lossless compression algorithms , 1997, Proceedings DCC '97. Data Compression Conference.

[6] Abraham Lempel,et al. A universal algorithm for sequential data compression , 1977, IEEE Trans. Inf. Theory.

[7] Donald R. Morrison,et al. PATRICIA—Practical Algorithm To Retrieve Information Coded in Alphanumeric , 1968, J. ACM.