Integrating Dictionaries into an Unsupervised Model for Myanmar Word Segmentation

This paper addresses the problem of word segmentation for low resource languages, with the main focus being on Myanmar language. In our proposed method, we focus on exploiting limited amounts of dictionary resource, in an attempt to improve the segmentation quality of an unsupervised word segmenter. Three models are proposed. In the first, a set of dictionaries (separate dictionaries for different classes of words) are directly introduced into the generative model. In the second, a language model was built from the dictionaries, and the n-gram model was inserted into the generative model. This model was expected to model words that did not occur in the training data. The third model was a combination of the previous two models. We evaluated our approach on a corpus of manually annotated data. Our results show that the proposed methods are able to improve over a fully unsupervised baseline system. The best of our systems improved the F-score from 0.48 to 0.66. In addition to segmenting the data, one proposed method is also able to partially label the segmented corpus with POS tags. We found that these labels were approximately 66% accurate.

[1]  Constantine Papageorgiou,et al.  Japanese Word Segmentation by Hidden Markov Model , 1994, HLT.

[2]  Naonori Ueda,et al.  Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling , 2009, ACL.

[3]  Maosong Sun,et al.  Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data , 1998, ACL.

[4]  S. L. Scott Bayesian Methods for Hidden Markov Models , 2002 .

[5]  Maosong Sun,et al.  Chinese Word Segmentation without Using Lexicon and Hand-crafted Training Data , 2022, International Conference on Computational Linguistics.

[6]  Baobao Chang,et al.  A Refined HDP-Based Model for Unsupervised Chinese Word Segmentation , 2013, CCL.

[7]  Yingying Wen,et al.  A compression based algorithm for Chinese word segmentation , 2000, CL.

[8]  Eiichiro Sumita,et al.  Creating corpora for speech-to-speech translation , 2003, INTERSPEECH.

[9]  Jin-Cheon Na,et al.  Word segmentation for the Myanmar language , 2008, J. Inf. Sci..

[10]  Zimin Wu,et al.  Chinese Text Segmentation for Text Retrieval: Achievements and Problems , 1993, J. Am. Soc. Inf. Sci..

[11]  T. Griffiths,et al.  A Bayesian framework for word segmentation: Exploring the effects of context , 2009, Cognition.

[12]  Kavi Narayana Murthy,et al.  Myanmar Word Segmentation using Syllable level Longest Matching , 2008, IJCNLP.

[13]  Eiichiro Sumita,et al.  Unsupervised and Semi-supervised Myanmar Word Segmentation Approaches for Statistical Machine Translation , 2013 .

[14]  Thanaruk Theeramunkong,et al.  Non-Dictionary-Based Thai Word Segmentation Using Decision Trees , 2001, HLT.

[15]  Thomas L. Griffiths,et al.  Contextual Dependencies in Unsupervised Word Segmentation , 2006, ACL.

[16]  Pusadee Seresangtakul,et al.  A hybrid approach to Lao word segmentation using longest syllable level matching with named entities recognition , 2013, 2013 10th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology.