Thai Sentence-Breaking for Large-Scale SMT

Thai language text presents challenges for integration into large-scale multilanguage statistical machine translation (SMT) systems, largely stemming from the nominal lack of punctuation and inter-word space. For Thai sentence breaking, we describe a monolingual maximum entropy classifier with features that may be applicable to other languages such as Arabic, Khmer and Lao. We apply this sentence breaker to our largevocabulary, general-purpose, bidirectional Thai-English SMT system, and achieve BLEU scores of around 0.20, reaching our threshold of releasing it as a free online service.

[1]  Wirote Aroonmanakun Thoughts on Word and Sentence Segmentation in Thai , 2007 .

[2]  Pradit Mittrapiyanuruk,et al.  THE AUTOMATIC THAI SENTENCE EXTRACTION , 2000 .

[3]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Part-Of-Speech Tagging , 1996, EMNLP.

[4]  C. Haruechaiyasak,et al.  A comparative study on Thai word segmentation approaches , 2008, 2008 5th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology.

[5]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[6]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[7]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[8]  Hitoshi Isahara,et al.  Building a Thai part-of-speech tagged corpus (ORCHID) , 1999 .

[9]  Philipp Koehn,et al.  Pharaoh: A Beam Search Decoder for Phrase-Based Statistical Machine Translation Models , 2004, AMTA.

[10]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[11]  J. Darroch,et al.  Generalized Iterative Scaling for Log-Linear Models , 1972 .

[12]  Marti A. Hearst,et al.  Adaptive Multilingual Sentence Boundary Disambiguation , 1997, CL.

[13]  Paul Taylor,et al.  Assigning phrase breaks from part-of-speech sequences , 1997, Comput. Speech Lang..

[14]  Virach Sornlertlamvanich,et al.  Automatic Sentence Break Disambiguation for Thai , 2001 .

[15]  Chris Quirk,et al.  Syntactic Models for Structural Word Insertion and Deletion during Translation , 2008, EMNLP.

[16]  Robert C. Moore Fast and accurate sentence alignment of bilingual corpora , 2002, AMTA.

[17]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Approach to Identifying Sentence Boundaries , 1997, ANLP.