Towards Burmese (Myanmar) Morphological Analysis

This article presents a comprehensive study on two primary tasks in Burmese (Myanmar) morphological analysis: tokenization and part-of-speech (POS) tagging. Twenty thousand Burmese sentences of newswire are annotated with two-layer tokenization and POS-tagging information, as one component of the Asian Language Treebank Project. The annotated corpus has been released under a CC BY-NC-SA license, and it is the largest open-access database of annotated Burmese when this manuscript was prepared in 2017. Detailed descriptions of the preparation, refinement, and features of the annotated corpus are provided in the first half of the article. Facilitated by the annotated corpus, experiment-based investigations are presented in the second half of the article, wherein the standard sequence-labeling approach of conditional random fields and a long short-term memory (LSTM)-based recurrent neural network (RNN) are applied and discussed. We obtained several general conclusions, covering the effect of joint tokenization and POS-tagging and importance of ensemble from the viewpoint of stabilizing the performance of LSTM-based RNN. This study provides a solid basis for further studies on Burmese processing.

[1]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[2]  Beatrice Santorini,et al.  The Penn Treebank: An Overview , 2003 .

[3]  Xuanjing Huang,et al.  Long Short-Term Memory Neural Networks for Chinese Word Segmentation , 2015, EMNLP.

[4]  J. Okell,et al.  Burmese/Myanmar dictionary of grammatical forms , 2001 .

[5]  Masao Utiyama,et al.  Introduction of the Asian Language Treebank , 2016, 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA).

[6]  Eduard H. Hovy,et al.  End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF , 2016, ACL.

[7]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[8]  Daichi Mochihashi,et al.  Nonparametric Bayesian Semi-supervised Word Segmentation , 2017, Transactions of the Association for Computational Linguistics.

[9]  Jürgen Schmidhuber,et al.  LSTM: A Search Space Odyssey , 2015, IEEE Transactions on Neural Networks and Learning Systems.

[10]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[11]  Vincent Berment,et al.  Méthodes pour informatiser les langues et les groupes de langues « peu dotées ». (Methods to computerize "little equipped" languages and groups of languages) , 2004 .

[12]  Thomas Emerson,et al.  The Second International Chinese Word Segmentation Bakeoff , 2005, IJCNLP.

[13]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[14]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[15]  Yuji Matsumoto,et al.  Chunking with Support Vector Machines , 2001, NAACL.

[16]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[17]  Masaki Murata,et al.  Named Entity Extraction Based on A Maximum Entropy Model and Transformation Rules , 2000, ACL.

[18]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[19]  Kevin Duh,et al.  DyNet: The Dynamic Neural Network Toolkit , 2017, ArXiv.

[20]  Tin Htay Hlaing Manually constructed context-free grammar for Myanmar syllable structure , 2012, EACL.

[21]  Yuji Matsumoto,et al.  Applying Conditional Random Fields to Japanese Morphological Analysis , 2004, EMNLP.

[22]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[23]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[24]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[25]  Masao Utiyama,et al.  NOVA , 2018, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[26]  Erik F. Tjong Kim Sang,et al.  Representing Text Chunks , 1999, EACL.

[27]  Hai Zhao,et al.  A Unified Character-Based Tagging Framework for Chinese Word Segmentation , 2010, TALIP.

[28]  Masao Utiyama,et al.  Burmese (Myanmar) Name Romanization: A Sub-syllabic Segmentation Scheme for Statistical Solutions , 2017, PACLING.

[29]  Graham Neubig,et al.  Pointwise Prediction for Robust, Adaptable Japanese Morphological Analysis , 2011, ACL.

[30]  Jürgen Schmidhuber,et al.  Learning Precise Timing with LSTM Recurrent Networks , 2003, J. Mach. Learn. Res..

[31]  Masao Utiyama,et al.  Word Segmentation for Burmese (Myanmar) , 2016, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[32]  Seung-Hoon Na,et al.  Conditional Random Fields for Korean Morpheme Segmentation and POS Tagging , 2015, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[33]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[34]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[35]  Karl Stratos,et al.  Simple Semi-Supervised POS Tagging , 2015, VS@HLT-NAACL.

[36]  Aye Myat Mon,et al.  Analysis of Myanmar Word boundary and segmentation by using Statistical Approach , 2010, 2010 3rd International Conference on Advanced Computer Theory and Engineering(ICACTE).

[37]  裕治 松本,et al.  Support Vector Machineを用いたChunk同定 , 2002 .

[38]  Khin Mar Soe,et al.  Myanmar Phrases Translation Model with Morphological Analysis for Statistical Myanmar to English Translation System , 2011, PACLIC.