论文信息 - Study of Japanese text compression

Study of Japanese text compression

Summary form only given. The Japanese language has several thousand distinct characters, and the character code length is 16 bits. In such documents the 16-bit units are interrelated. Conventional text compression employs 8-bit sampling because the compressed object is usually English text. We investigated compression schemes based on 16-bit sampling, expecting it to improve the compression performance. In Japanese text where words are short, statistical schemes with a PPM provide better compression ratios than slide dictionary schemes. So we investigated the 16-bit sampling based on statistical schemes with a PPM model. We show the 16-bit sampling scheme provides good compression ratios in short documents under several tens of kilobytes, such as office reports. The processing speed is also better.

Shigeru Yoshida | T. Morihara | N. Satoh | Y. Okada

[1] Ian H. Witten,et al. Modeling for text compression , 1989, CSUR.

[2] Chi-Hung Chi,et al. Extending Huffman coding for multilingual text compression , 1995, Proceedings DCC '95 Data Compression Conference.