Simplified Abugidas

An abugida is a writing system where the consonant letters represent syllables with a default vowel and other vowels are denoted by diacritics. We investigate the feasibility of recovering the original text written in an abugida after omitting subordinate diacritics and merging consonant letters with similar phonetic values. This is crucial for developing more efficient input methods by reducing the complexity in abugidas. Four abugidas in the southern Brahmic family, i.e., Thai, Burmese, Khmer, and Lao, were studied using a newswire 20,000-sentence dataset. We compared the recovery performance of a support vector machine and an LSTM-based recurrent neural network, finding that the abugida graphemes could be recovered with 94% - 97% accuracy at the top-1 level and 98% - 99% at the top-4 level, even after omitting most diacritics (10 - 30 types) and merging the remaining 30 - 50 characters into 21 graphemes.

[1]  Hai Zhao,et al.  A Machine Translation Approach for Chinese Whole-Sentence Pinyin-to-Character Conversion , 2012 .

[2]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[3]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[4]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[5]  Harold W. Thimbleby,et al.  Semantic and Generative Models for Lossy Text Compression , 1994, Comput. J..

[6]  Xuan Wang,et al.  A Maximum Entropy Approach to Chinese Pin Yin-To-Character Conversion , 2006, 2006 IEEE International Conference on Systems, Man and Cybernetics.

[7]  Shinsuke Mori,et al.  Discriminative Method for Japanese Kana-Kanji Input Method , 2011, WTIM@IJCNLP.

[8]  Kevin Duh,et al.  DyNet: The Dynamic Neural Network Toolkit , 2017, ArXiv.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Lu,et al.  A conditional random fields approach to Chinese pinyin-to-character conversion , 2009 .

[11]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[12]  Masao Utiyama,et al.  Introduction of the Asian Language Treebank , 2016, 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA).

[13]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[14]  Yoshiyori Urano,et al.  The design of Khmer word-based predictive non-QWERTY soft keyboard for stylus-based devices , 2008, 2008 IEEE Symposium on Visual Languages and Human-Centric Computing.

[15]  Zheng Chen,et al.  A New Statistical Approach To Chinese Pinyin Input , 2000, ACL.

[16]  Liu Bingquan Pinyin-to-Character Conversion Model Based on Support Vector Machines , 2007 .