论文信息 - A Unified Model of Thai Romanization and Word Segmentation

A Unified Model of Thai Romanization and Word Segmentation

Thai romanization is the way to write Thai language using roman alphabets. It could be performed on the basis of orthographic form (transliteration) or pronunciation (transcription) or both. As a result, many systems of romanization are in use. The Royal Institute has established the standard by proposing the principle of romanization on the basis of transcription. To ensure the standard, a fully automatic Thai romanization system should be publicly made available. In this paper, we discuss the problems of Thai Romanization. We argue that automatic Thai romanization is difficult because the ambiguities of pronunciation are caused not only by the ambiguities of syllable segmentation, but also by the ambiguities of word segmentation. A model of automatic romanization then is designed and implemented on this ground. The problem of romanization and word segmentation are handled simultaneously. A syllable-segmented corpus and a corpus of word-pronunciation are used for training the system. The accuracy of the system is 94.44% for unseen names and 99.58% for general texts. When the training corpus includes some proper names, the accuracy of romanizing unseen names was increased from 94.44% to 97%. Our system performs well because it is designed to better suit the problem.

Wanchai Rivepiboon | Wirote Aroonmanakun | Wirote Aroonmanakun | W. Rivepiboon

[1] Randall K. Barry,et al. ALA-LC Romanization Tables: Transliteration Schemes for Non-Roman Scripts , 1991 .

[2] Walter Daelemans,et al. Data-Oriented Methods for Grapheme-to-Phoneme Conversion , 1993, EACL.

[3] Kevin Knight,et al. Machine Transliteration , 1997, CL.

[4] F ChenStanley,et al. An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[5] Alan W. Black,et al. Statistically trained orthographic to sound models for Thai , 2000, INTERSPEECH.

[6] Virach Sornlertlamvanich,et al. Thai grapheme-to-phoneme using probabilistic GLR parser , 2001, INTERSPEECH.

[7] So Sethaputra,et al. Thai-English dictionary , 2001 .

[8] Wirote Aroonmanakun,et al. Collocation and Thai Word Segmentation , 2002 .

[9] Yaser Al-Onaizan,et al. Machine Transliteration of Names in Arabic Texts , 2002, SEMITIC@ACL.

[10] Virach Sornlertlamvanich,et al. A Context-Sensitive Homograph Disambiguation in Thai Text-to-Speech Synthesis , 2003, NAACL.