Abstract ISO 10646 Universal Character Set (UCS) covers symbols in most of the world's written languages. There are various UCS transformation formats (UTF), but UTF-8 is the most important one because of its compatibility with both software systems and communication systems that assume 8-bit characters. At first, three properties an UTF-8-like transformation format should satisfy are defined to preserve the main characteristics of UTF-8. Then, a derived 5-byte sequence with 31 free bits is illustrated to construct an UTF-8-like transformation format, which is capable of resolving the dummy byte sequences locally. After that, we try to reveal if the last byte patterns of the 3-byte and 4-byte sequences in the UTF-8-like transformation format are replaced with byte pattern 1xxxxxxx, two more free bits for the 3-byte and 4-byte sequences can be increased. The final version of the derived UTF-8-like transformation format, UTF-8M, is proved to have the minimal average storage of encoding an UCS-4 character, 16.3% less than what UTF-8 requires.
[1]
Paul E. Hoffman,et al.
UTF-16, an encoding of ISO 10646
,
2000,
RFC.
[2]
Francois Yergeau.
UTF-8, a transformation format of ISO 10646
,
1998,
RFC.
[3]
Bill Curtin.
Internationalization of the File Transfer Protocol
,
1999,
RFC.
[4]
Harald Tveit Alvestrand.
IETF Policy on Character Sets and Languages
,
1998,
RFC.
[5]
Ken Lunde,et al.
CJKV Information Processing
,
1999
.
[6]
Mark Davis,et al.
The Unicode Standard, Version 3.0
,
2000
.
[7]
Jon Postel,et al.
File Transfer Protocol
,
1985,
RFC.