Khmer Word Segmentation Using Conditional Random Fields

Word Segmentation is a critical task that is the foundation of much natural language processing research. This paper is a study of Khmer word segmentation using an approach based on conditional random fields (CRFs). A large manually-segmented corpus was developed to train the segmenter, and we provide details of a set of word segmentation strategies that were used by the human annotators during the manual annotation. The trained CRF segmenter was compared empirically to a baseline approach based on maximum matching that used a dictionary extracted from the manually segmented corpus. The CRF segmenter outperformed the baseline in terms of precision, recall and f-score by a wide margin. The segmenter was also evaluated as a pre-processing step in a statistical machine translation system. It gave rise to substantial increases in BLEU score of up to 7.7 points, relative to a maximum matching baseline.

[1]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[2]  Philipp Koehn,et al.  Edinburgh’s Submission to all Tracks of the WMT 2009 Shared Task with Reordering and Speed Improvements to Moses , 2009, WMT@EACL.

[3]  Chea Sok Huor,et al.  Word Bigram Vs Orthographic Syllable Bigram in Khmer Word Segmentation , 2007 .

[4]  Eiichiro Sumita,et al.  Creating corpora for speech-to-speech translation , 2003, INTERSPEECH.

[5]  M. Ehrman,et al.  Contemporary Cambodian: grammatical sketch , 1972 .

[6]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[7]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[8]  Masao Utiyama,et al.  A Large-scale Study of Statistical Machine Translation Methods for Khmer Language , 2015, PACLIC.

[9]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[10]  Christoph Tillmann,et al.  A Unigram Orientation Model for Statistical Machine Translation , 2004, NAACL.

[11]  Nguonly Taing,et al.  Khmer word segmentation based on Bi-directional Maximal Matching for Plaintext and Microsoft Word document , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[12]  Laurent Besacier,et al.  First Broadcast News Transcription System for Khmer Language , 2008, LREC.

[13]  Chea Sok Huor,et al.  Detection and Correction of Homophonous Error Word for Khmer Language , 2007 .

[14]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[15]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[16]  Wataru Kameyama,et al.  Khmer Word Segmentation and Out-of-Vocabulary Words Detection Using Collocation Measurement of Repeated Characters Subsequences , 2013 .