Integrated Chinese Word Segmentation in Statistical Machine Translation

A Chinese sentence is represented as a sequence of characters, and words are not separated from each other. In statistical machine translation, the conventional approach is to segment the Chinese character sequence into words during the pre-processing. The training and translation are performed afterwards. However, this method is not optimal for two reasons: 1. The segmentations may be erroneous. 2. For a given character sequence, the best segmentation depends on its context and translation. In order to minimize the translation errors, we take different segmentation alternatives instead of a single segmentation into account and integrate the segmentation process with the search for the best translation. The segmentation decision is only taken during the generation of the translation. With this method we are able to translate Chinese text at the character level. The experiments on the IWSLT 2005 task showed improvements in the translation performance using two translation systems: a phrase-based system and a finite state transducer based system. For the phrase-based system, the improvement of the BLEU score is 1.5% absolute.

[1]  Y. Zhang,et al.  Integrated phrase segmentation and alignment algorithm for statistical machine translation , 2003, International Conference on Natural Language Processing and Knowledge Engineering, 2003. Proceedings. 2003.

[2]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[3]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[4]  Hermann Ney,et al.  Improvements in Phrase-Based Statistical Machine Translation , 2004, NAACL.

[5]  Xiaoqiang Luo,et al.  An Iterative Algorithm to Build Chinese Language Models , 1996, ACL.

[6]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[7]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[8]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[9]  Hermann Ney,et al.  Speech translation: coupling of recognition and translation , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[10]  Hermann Ney,et al.  Speech-to-speech translation based on finite-state transducers , 2001, 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221).

[11]  Hermann Ney,et al.  FSA: An Efficient and Flexible C++ Toolkit for Finite State Automata Using On-Demand Computation , 2004, ACL.

[12]  Sergei Nirenburg,et al.  A Statistical Approach to Machine Translation , 2003 .

[13]  Hermann Ney,et al.  Do We Need Chinese Word Segmentation for Statistical Machine Translation? , 2004, SIGHAN@ACL.

[14]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.