Study of Japanese text compression

Summary form only given. The Japanese language has several thousand distinct characters, and the character code length is 16 bits. In such documents the 16-bit units are interrelated. Conventional text compression employs 8-bit sampling because the compressed object is usually English text. We investigated compression schemes based on 16-bit sampling, expecting it to improve the compression performance. In Japanese text where words are short, statistical schemes with a PPM provide better compression ratios than slide dictionary schemes. So we investigated the 16-bit sampling based on statistical schemes with a PPM model. We show the 16-bit sampling scheme provides good compression ratios in short documents under several tens of kilobytes, such as office reports. The processing speed is also better.

[1]  Ian H. Witten,et al.  Modeling for text compression , 1989, CSUR.

[2]  Chi-Hung Chi,et al.  Extending Huffman coding for multilingual text compression , 1995, Proceedings DCC '95 Data Compression Conference.