An Empirical Study on Word Segmentation for Chinese Machine Translation

Word segmentation has been shown helpful for Chinese-to-English machine translation (MT), yet the way different segmentation strategies affect MT is poorly understood. In this paper, we focus on comparing different segmentation strategies in terms of machine translation quality. Our empirical study covers both English-to-Chinese and Chinese-to-English translation for the first time. Our results show the necessity of word segmentation depends on the translation direction. After comparing two types of segmentation strategies with associated linguistic resources, we demonstrate that optimizing segmentation itself does not guarantee better MT performance, and segmentation strategy choice is not the key to improve MT. Instead, we discover that linguistical resources such as segmented corpora or the dictionaries that segmentation tools rely on actually determine how word segmentation affects machine translation. Based on these findings, we propose an empirical approach that directly optimize dictionary with respect to the MT task for word segmenter, providing a BLEU score improvement of 1.30.

[1]  Yanjun Ma,et al.  Bilingually Motivated Domain-Adapted Word Segmentation for Statistical Machine Translation , 2009, EACL.

[2]  Gina-Anne Levow,et al.  The Third International Chinese Language Processing Bakeoff: Word Segmentation and Named Entity Recognition , 2006, SIGHAN@COLING/ACL.

[3]  Christopher D. Manning,et al.  Optimizing Chinese Word Segmentation for Machine Translation Performance , 2008, WMT@ACL.

[4]  Andrew McCallum,et al.  Chinese Segmentation and New Word Detection using Conditional Random Fields , 2004, COLING.

[5]  Alexandra Birch,et al.  Proceedings of the Joint 5th Workshop on Statistical Machine Translation and MetricsMATR , 2010 .

[6]  Eiichiro Sumita,et al.  Improved Statistical Machine Translation by Multiple Chinese Word Segmentation , 2008, WMT@ACL.

[7]  Eiichiro Sumita,et al.  Integration of Multiple Bilingually-Learned Segmentation Schemes into Statistical Machine Translation , 2010, WMT@ACL.

[8]  Philipp Koehn,et al.  Empirical Methods for Compound Splitting , 2003, EACL.

[9]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[10]  Changning Huang,et al.  Chinese Word Segmentation and Named Entity Recognition: A Pragmatic Approach , 2005, CL.

[11]  Thomas Emerson,et al.  The Second International Chinese Word Segmentation Bakeoff , 2005, IJCNLP.

[12]  Nizar Habash,et al.  Permission is granted to quote short excerpts and to reproduce figures and tables from this report, provided that the source of such material is fully acknowledged. Arabic Preprocessing Schemes for Statistical Machine Translation , 2006 .

[13]  Kentaro Torisawa,et al.  Adapting Chinese Word Segmentation for Machine Translation Based on Short Units , 2010, LREC.

[14]  Jean Carletta,et al.  Proceedings of the Human Language Technology Conference of the NAACL, Main Conference , 2006 .

[15]  Hai Zhao,et al.  Unsupervised Segmentation Helps Supervised Learning of Character Tagging for Word Segmentation and Named Entity Recognition , 2008, IJCNLP.

[16]  Hai Zhao,et al.  An Improved Chinese Word Segmentation System with Conditional Random Field , 2006, SIGHAN@COLING/ACL.

[17]  Noah A. Smith,et al.  Nonparametric Word Segmentation for Machine Translation , 2010, COLING.

[18]  I. Dan Melamed,et al.  Models of translation equivalence among words , 2000, CL.

[19]  Hwee Tou Ng,et al.  Automatic Evaluation of Chinese Translation Output: Word-Level or Character-Level? , 2011, ACL.

[20]  Hwee Tou Ng,et al.  A Maximum Entropy Approach to Chinese Word Segmentation , 2005, SIGHAN@IJCNLP 2005.

[21]  Hermann Ney,et al.  Integrated Chinese Word Segmentation in Statistical Machine Translation , 2005, IWSLT.

[22]  Smaranda Muresan,et al.  Generalizing Word Lattice Translation , 2008, ACL.

[23]  Daniel Jurafsky,et al.  A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005 , 2005, IJCNLP.

[24]  Hermann Ney,et al.  Bayesian Semi-Supervised Chinese Word Segmentation for Statistical Machine Translation , 2008, COLING.

[25]  Richard Sproat,et al.  The First International Chinese Word Segmentation Bakeoff , 2003, SIGHAN.

[26]  Nianwen Xue,et al.  Chinese Word Segmentation as LMR Tagging , 2003, SIGHAN.

[27]  Hermann Ney,et al.  Do We Need Chinese Word Segmentation for Statistical Machine Translation? , 2004, SIGHAN@ACL.

[28]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[29]  Eiichiro Sumita,et al.  Overview of the Patent Machine Translation Task at the NTCIR-10 Workshop , 2011, NTCIR.

[30]  Spyridon Matsoukas,et al.  BBN's Systems for the Chinese-English Sub-task of the NTCIR-10 PatentMT Evaluation , 2013, NTCIR.

[31]  Xiaotie Deng,et al.  Accessor Variety Criteria for Chinese Word Extraction , 2004, CL.

[32]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[33]  Philipp Koehn,et al.  Proceedings of the Third Workshop on Statistical Machine Translation , 2008, WMT@ACL.

[34]  Hai Zhao,et al.  An Empirical Comparison of Goodness Measures for Unsupervised Chinese Word Segmentation with a Unified Framework , 2008, IJCNLP.