论文信息 - Myanmar Word Segmentation using Syllable level Longest Matching

Myanmar Word Segmentation using Syllable level Longest Matching

In Myanmar language, sentences are clearly delimited by a unique sentence boundary marker but are written without necessarily pausing between words with spaces. It is therefore non-trivial to segment sentences into words. Word tokenizing plays a vital role in most Natural Language Processing applications. We observe that word boundaries generally align with syllable boundaries. Working directly with characters does not help. It is therefore useful to syllabify texts first. Syllabification is also a non-trivial task in Myanmar. We have collected 4550 syllables from available sources . We have evaluated our syllable inventory on 2,728 sentences spread over 258 pages and observed a coverage of 99.96%. In the second part, we build word lists from available sources such as dictionaries, through the application of morphological rules, and by generating syllable n-grams as possible words and manually checking. We have thus built list of 800,000 words including inflected forms. We have tested our algorithm on a 5000 sentence test data set containing a total of (35049 words) and manually checked for evaluating the performance. The program recognized 34943 words of which 34633 words were correct, thus giving us a Recall of 98.81%, a Precision of 99.11% and a FMeasure is 98.95%.

Kavi Narayana Murthy | Hla Hla Htay

[1] Orchestra , 1998 .

[2] Peter Willett,et al. Automatic Spelling Correction Using a Trigram Similarity Measure , 1983, Inf. Process. Manag..

[3] Kalervo Järvelin,et al. Targeted s-gram matching: a novel n-gram matching technique for cross- and mono-lingual word form variants , 2002, Inf. Res..

[4] Youngja Park. Identification of Probable Real Words: An Entropy-based Approach , 2002, ACL 2002.