Optimizing Chinese Word Segmentation for Machine Translation Performance

Previous work has shown that Chinese word segmentation is useful for machine translation to English, yet the way different segmentation strategies affect MT is still poorly understood. In this paper, we demonstrate that optimizing segmentation for an existing segmentation standard does not always yield better MT performance. We find that other factors such as segmentation consistency and granularity of Chinese "words" can be more important for machine translation. Based on these findings, we implement methods inside a conditional random field segmenter that directly optimize segmentation granularity with respect to the MT task, providing an improvement of 0.73 BLEU. We also show that improving segmentation consistency using external lexicon and proper noun features yields a 0.32 BLEU increase.

[1]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[2]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[3]  Dale Schuurmans,et al.  Investigating the Relationship between Word Segmentation Performance and Retrieval Performance in Chinese IR , 2002, COLING.

[4]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[5]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[6]  Philipp Koehn,et al.  Empirical Methods for Compound Splitting , 2003, EACL.

[7]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[8]  Pat Langley,et al.  Editorial: On Machine Learning , 1986, Machine Learning.

[9]  Hwee Tou Ng,et al.  Chinese Part-of-Speech Tagging: One-at-a-Time or All-at-Once? Word-Based or Character-Based? , 2004, EMNLP.

[10]  Andrew McCallum,et al.  Chinese Segmentation and New Word Detection using Conditional Random Fields , 2004, COLING.

[11]  Hermann Ney,et al.  Do We Need Chinese Word Segmentation for Statistical Machine Translation? , 2004, SIGHAN@ACL.

[12]  M. de Rijke,et al.  Monolingual Document Retrieval for European Languages , 2004, Information Retrieval.

[13]  Fei Xia,et al.  The Penn Chinese TreeBank: Phrase structure annotation of a large corpus , 2005, Natural Language Engineering.

[14]  Changning Huang,et al.  Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach , 2005, CL.

[15]  Daniel Jurafsky,et al.  A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005 , 2005, IJCNLP.

[16]  William W. Cohen,et al.  NER Systems that Suit User’s Preferences: Adjusting the Recall-Precision Trade-off for Entity Extraction , 2006, NAACL.

[17]  Gina-Anne Levow,et al.  The Third International Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition , 2006, SIGHAN@COLING/ACL.

[18]  Nizar Habash,et al.  Permission is granted to quote short excerpts and to reproduce figures and tables from this report, provided that the source of such material is fully acknowledged. Arabic Preprocessing Schemes for Statistical Machine Translation , 2006 .

[19]  Mengqiu Wang,et al.  A Dual-layer CRFs Based Joint Decoding Method for Cascaded Segmentation and Labeling Tasks , 2007, IJCAI.

[20]  Wolfgang Macherey,et al.  Lattice-based Minimum Error Rate Training for Statistical Machine Translation , 2008, EMNLP.