Detecting Romanized Thai tokens in social media texts

Social media contents were created by a large number of users or writers. Additionally, each of them has their own writing styles, which depend on their creative thinking or attitudes. As commonly found in online social networks of Thai users, typed texts sometimes include Thai words that were transliterated with Roman letters. Therefore, text-to-speech systems cannot pronounce these transliterated tokens correctly. In this work, we propose and evaluate statistical methods for detecting Romanized Thai tokens. Both context-dependent and context-free classification features are proposed. Real social network texts are used for constructing the training set and the test set. Human subjects can detect Thai Romanized tokens at 91.16% accuracy on average when adjacent contexts are hidden while the accuracy is at 99.41% with contexts. With the proposed features, a decision tree-based classifier and an N-gram-based classifier yield 87.63% and 74.42% accuracy, respectively. In the later case, the accuracy increases to 82.60% when the tokens' existence in English dictionaries is considered. Combining the two methods results in a detection accuracy of 89.36%.

[1]  Chew Yew Choong,et al.  Optimizing n‑gram Order of an n‑gram Based Language Identification Algorithm for 68 Written Languages , 2009 .

[2]  Ashish Verma,et al.  Language identification of person names using CF-IOF based weighing function , 2007, INTERSPEECH.

[3]  Grzegorz Kondrak,et al.  Language identification of names with SVMs , 2010, HLT-NAACL.

[4]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[5]  Prusayon Nintanavongsa,et al.  A dual-band wireless energy transfer protocol for heterogeneous sensor networks powered by RF energy harvesting , 2013, 2013 International Computer Science and Engineering Conference (ICSEC).

[6]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[7]  Chai Wutiwiwatchai,et al.  Accent level adjustment in bilingual Thai-English text-to-speech synthesis , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[8]  Radim Rehurek,et al.  Language Identification on the Web: Extending the Dictionary Method , 2009, CICLing.

[9]  Montree Kumngern,et al.  Voltage-mode universal biquadratic filter using a single DDCCTA , 2013, 2013 International Computer Science and Engineering Conference (ICSEC).

[10]  Uthai Phommasak,et al.  A policy-improving system with a mixture probability and clustering distributions to unknown 3d-environments , 2013, 2013 International Computer Science and Engineering Conference (ICSEC).

[11]  Jilei Tian,et al.  n-gram and decision tree based language identification for written words , 2001, IEEE Workshop on Automatic Speech Recognition and Understanding, 2001. ASRU '01..

[12]  Ian H. Witten,et al.  The WEKA data mining software: an update , 2009, SKDD.

[13]  Thomas Gottron,et al.  A Comparison of Language Identification Approaches on Short, Query-Style Texts , 2010, ECIR.

[14]  Thung Khru,et al.  A Design of Thai-English Transliterated Word Retrieval for Smart Phones , 2012 .

[15]  Yong Zhao,et al.  Identifying Language Origin of Person Names With N-Grams of Different Units , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[16]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[17]  Ted E. Dunning,et al.  Statistical Identification of Language , 1994 .

[18]  Ikuo Tahara,et al.  Language Identification for Person Names Based on Statistical Information , 2005, PACLIC.

[19]  A. Suresh Babu Comparing Neural Network Approach With N- Gram Approach For Text Categorization , 2010 .

[20]  T. Kobayashi,et al.  A bi-lingual Thai-English TTS system on Android mobile devices , 2012, 2012 9th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology.