论文信息 - A modified UTF-8 transformation format of ISO 10646 for storage optimization

A modified UTF-8 transformation format of ISO 10646 for storage optimization

Abstract ISO 10646 Universal Character Set (UCS) covers symbols in most of the world's written languages. There are various UCS transformation formats (UTF), but UTF-8 is the most important one because of its compatibility with both software systems and communication systems that assume 8-bit characters. At first, three properties an UTF-8-like transformation format should satisfy are defined to preserve the main characteristics of UTF-8. Then, a derived 5-byte sequence with 31 free bits is illustrated to construct an UTF-8-like transformation format, which is capable of resolving the dummy byte sequences locally. After that, we try to reveal if the last byte patterns of the 3-byte and 4-byte sequences in the UTF-8-like transformation format are replaced with byte pattern 1xxxxxxx, two more free bits for the 3-byte and 4-byte sequences can be increased. The final version of the derived UTF-8-like transformation format, UTF-8M, is proved to have the minimal average storage of encoding an UCS-4 character, 16.3% less than what UTF-8 requires.

Cheng-Huang Tung | Ming-Chi Lee

[1] Paul E. Hoffman,et al. UTF-16, an encoding of ISO 10646 , 2000, RFC.

[2] Francois Yergeau. UTF-8, a transformation format of ISO 10646 , 1998, RFC.

[3] Bill Curtin. Internationalization of the File Transfer Protocol , 1999, RFC.

[4] Harald Tveit Alvestrand. IETF Policy on Character Sets and Languages , 1998, RFC.

[5] Ken Lunde,et al. CJKV Information Processing , 1999 .

[6] Mark Davis,et al. The Unicode Standard, Version 3.0 , 2000 .

[7] Jon Postel,et al. File Transfer Protocol , 1985, RFC.