论文信息 - Improving chinese-english machine translation through better source-side linguistic processing

Improving chinese-english machine translation through better source-side linguistic processing

Machine Translation (MT) is a task with multiple components, each of which can be very challenging. This thesis focuses on a difficult language pair—Chinese to English—and works on several language-specific aspects that make translation more difficult. The first challenge this thesis focuses on is the differences in the writing systems. In Chinese there are no explicit boundaries between words, and even the definition of a "word" is unclear. We build a general purpose Chinese word segmenter with linguistically inspired features that performs very well on the SIGHAN 2005 bakeoff data. Then we study how Chinese word segmenter performance is related to MT performance, and provide a way to tune the "word" unit in Chinese so that it can better match up with the English word granularity, and therefore improve MT performance. The second challenge we address is different word order between Chinese and English. We first perform error analysis on three state-of-the-art MT systems to see what the most prominent problems are, especially how different word orders cause translation errors. According to our findings, we propose two solutions to improve Chinese-to-English MT systems. First, word reordering, especially over longer distances, has caused many errors. Even though Chinese and English are both Subject-Verb-Object (SVO) languages, they usually use different word orders in noun phrases, prepositional phrases, etc. Many of these different word orders can be long distance ones and cause difficulty for MT systems. There have been many previous studies on this. In this thesis, we introduce a richer set of Chinese grammatical relations that describes more semantically abstract relations between words. We are able to integrate these Chinese grammatical relations into the most used, state-of-the-art phrase-based MT system and to improve its performance. Second, we study the behavior of the most common Chinese word (DE), which does not have a direct mapping to English. DE serves different functions in Chinese, and therefore can be ambiguous when translating to English. It might also cause longer distance reordering when translating to English. We propose a classifier to disambiguate DEs in Chinese text. Using this classifier, we improve the English translation quality because we can make the Chinese word orders much more similar to English, and we also disambiguate when a DE should be translated to different constructions (e.g., relative clause, prepositional phrase, etc.).

Christopher D. Manning | Pi-Chuan Chang

[1] Christopher D. Manning,et al. Generating Typed Dependency Parses from Phrase Structure Parses , 2006, LREC.

[2] Christoph Tillmann,et al. A Unigram Orientation Model for Statistical Machine Translation , 2004, NAACL.

[3] Philipp Koehn,et al. Statistical Post-Editing on SYSTRAN‘s Rule-Based Translation System , 2007, WMT@ACL.

[4] Richard Sproat,et al. The First International Chinese Word Segmentation Bakeoff , 2003, SIGHAN.

[5] Hermann Ney,et al. Chunk-Level Reordering of Source Language Sentences with Automatically Learned Rules for Statistical Machine Translation , 2007, SSST@HLT-NAACL.

[6] Ben Taskar,et al. Alignment by Agreement , 2006, NAACL.

[7] Richard M. Schwartz,et al. Improved Word-Level System Combination for Machine Translation , 2007, ACL.

[8] Chao Wang,et al. Chinese Syntactic Reordering for Statistical Machine Translation , 2007, EMNLP.

[9] F. Xia,et al. The Part-Of-Speech Tagging Guidelines for the Penn Chinese Treebank (3.0) , 2000 .

[10] Roger Levy,et al. Is it Harder to Parse Chinese, or the Chinese Treebank? , 2003, ACL.

[11] Anette Rosenbach,et al. Aspects of iconicity and economy in the choice between the s-genitive and the of-genitive in English , 2003 .