论文信息 - Khmer Word Segmentation Using Conditional Random Fields

Khmer Word Segmentation Using Conditional Random Fields

Word Segmentation is a critical task that is the foundation of much natural language processing research. This paper is a study of Khmer word segmentation using an approach based on conditional random fields (CRFs). A large manually-segmented corpus was developed to train the segmenter, and we provide details of a set of word segmentation strategies that were used by the human annotators during the manual annotation. The trained CRF segmenter was compared empirically to a baseline approach based on maximum matching that used a dictionary extracted from the manually segmented corpus. The CRF segmenter outperformed the baseline in terms of precision, recall and f-score by a wide margin. The segmenter was also evaluated as a pre-processing step in a statistical machine translation system. It gave rise to substantial increases in BLEU score of up to 7.7 points, relative to a maximum matching baseline.

[1] Andreas Stolcke,et al. SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[2] Philipp Koehn,et al. Edinburgh’s Submission to all Tracks of the WMT 2009 Shared Task with Reordering and Speed Improvements to Moses , 2009, WMT@EACL.

[3] Chea Sok Huor,et al. Word Bigram Vs Orthographic Syllable Bigram in Khmer Word Segmentation , 2007 .

[4] Eiichiro Sumita,et al. Creating corpora for speech-to-speech translation , 2003, INTERSPEECH.

[5] M. Ehrman,et al. Contemporary Cambodian: grammatical sketch , 1972 .

[6] Franz Josef Och,et al. Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[7] Andrew McCallum,et al. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[8] Masao Utiyama,et al. A Large-scale Study of Statistical Machine Translation Methods for Khmer Language , 2015, PACLIC.

[9] Daniel Marcu,et al. Statistical Phrase-Based Translation , 2003, NAACL.

[10] Christoph Tillmann,et al. A Unigram Orientation Model for Statistical Machine Translation , 2004, NAACL.

[11] Nguonly Taing,et al. Khmer word segmentation based on Bi-directional Maximal Matching for Plaintext and Microsoft Word document , 2014, Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2014 Asia-Pacific.

[12] Laurent Besacier,et al. First Broadcast News Transcription System for Khmer Language , 2008, LREC.

[13] Chea Sok Huor,et al. Detection and Correction of Homophonous Error Word for Khmer Language , 2007 .

[14] Philipp Koehn,et al. Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[15] Hermann Ney,et al. Improved Statistical Alignment Models , 2000, ACL.

[16] Wataru Kameyama,et al. Khmer Word Segmentation and Out-of-Vocabulary Words Detection Using Collocation Measurement of Repeated Characters Subsequences , 2013 .