Word Segmentation for Burmese (Myanmar)

Experiments on various word segmentation approaches for the Burmese language are conducted and discussed in this note. Specifically, dictionary-based, statistical, and machine learning approaches are tested. Experimental results demonstrate that statistical and machine learning approaches perform significantly better than dictionary-based approaches. We believe that this note, based on an annotated corpus of relatively considerable size (containing approximately a half million words), is the first systematic comparison of word segmentation approaches for Burmese. This work aims to discover the properties and proper approaches to Burmese textual processing and to promote further researches on this understudied language.

[1]  Hai Zhao,et al.  A Unified Character-Based Tagging Framework for Chinese Word Segmentation , 2010, TALIP.

[2]  Masao Utiyama,et al.  Empirical Dependency-Based Head Finalization for Statistical Chinese-, English-, and French-to-Myanmar (Burmese) Machine Translation , 2014 .

[3]  Graham Neubig,et al.  Pointwise Prediction for Robust, Adaptable Japanese Morphological Analysis , 2011, ACL.

[4]  Eiichiro Sumita,et al.  Integrating Dictionaries into an Unsupervised Model for Myanmar Word Segmentation , 2014, WSSANLP@COLING.

[5]  Eiichiro Sumita,et al.  Creating corpora for speech-to-speech translation , 2003, INTERSPEECH.

[6]  Jin-Cheon Na,et al.  Word segmentation for the Myanmar language , 2008, J. Inf. Sci..

[7]  Manabu Sassano Deterministic Word Segmentation Using Maximum Matching with Fully Lexicalized Rules , 2014, EACL.

[8]  Richard Sproat,et al.  The First International Chinese Word Segmentation Bakeoff , 2003, SIGHAN.

[9]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[10]  Kavi Narayana Murthy,et al.  Myanmar Word Segmentation using Syllable level Longest Matching , 2008, IJCNLP.

[11]  Seung-Hoon Na,et al.  Conditional Random Fields for Korean Morpheme Segmentation and POS Tagging , 2015, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[12]  Yuji Matsumoto,et al.  Applying Conditional Random Fields to Japanese Morphological Analysis , 2004, EMNLP.

[13]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[14]  Zhao Hai,et al.  Chinese Word Segmentation: A Decade Review , 2007 .