Word segmentation for the Myanmar language

This study reports the development of a Myanmar word segmentation method using Unicode standard encoding. Word segmentation is an essential step prior to natural language processing in the Myanmar language, because a Myanmar text is a string of characters without explicit word boundary delimiters. The proposed method has two phases: syllable segmentation and syllable merging. A rule-based heuristic approach was adopted for syllable segmentation, and a dictionary-based statistical approach for syllable merging. Evaluation of test results showed that the method is very effective for the Myanmar language.

[1]  Nguyen Van Toan,et al.  Vietnamese Word Segmentation , 2001, NLPRS.

[2]  Albert Sydney Hornby,et al.  Oxford Advanced Learner's Dictionary , 1974 .

[3]  Stafford Street,et al.  A method for word segmentation in Vietnamese , 2003 .

[4]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[5]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[6]  C. J. van Rijsbergen,et al.  The geometry of information retrieval , 2004 .

[7]  F ChenStanley,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[8]  Stanley F. Chen,et al.  An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[9]  Virach Sornlertlamvanich,et al.  Automatic Corpus-Based Thai Word Extraction with the C4.5 Learning Algorithm , 2000, COLING.

[10]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[11]  Zbigniew Michalewicz,et al.  Genetic Algorithms + Data Structures = Evolution Programs , 1996, Springer Berlin Heidelberg.

[12]  Dien Dinh,et al.  Building a Training Corpus for Word Sense Disambiguation in English-to-Vietnamese Machine Translation , 2002, COLING 2002.

[13]  Thanaruk Theeramunkong,et al.  Non-Dictionary-Based Thai Word Segmentation Using Decision Trees , 2001, HLT.

[14]  Hung Nguyen,et al.  Word Segmentation for Vietnamese Text Categorization : An online corpus approach , 2022 .

[15]  Alberto Maria Segre,et al.  Programs for Machine Learning , 1994 .

[16]  Wirote Aroonmanakun,et al.  Collocation and Thai Word Segmentation , 2002 .