论文信息 - Statistical Approaches to Patent Translation for PatentMT - Experiments with Various Settings of Training Data

Statistical Approaches to Patent Translation for PatentMT - Experiments with Various Settings of Training Data

This paper describes our experiments and results in the NTCIR-9 Chinese-to-English Patent Translation Task. A series of open source software were integrated to build a statistical machine translation model for the task. Various Chinese segmentation, additional resources, and training corpus preprocessing were then tried based on this model. As a result, more than 20 experiments were conducted to compare the translation performance. Our current results show that 1) consistent segmentation between the training and testing data is important to maintain the performance; 2) sufficient number of good quality bilingual training sentences is more helpful than additional bilingual dictionaries; and 3) the translation effectiveness in BLEU values doubles as the number of bilingual training sentences at the level of 100,000 doubles.

[1] Eiichiro Sumita,et al. Overview of the Patent Machine Translation Task at the NTCIR-10 Workshop , 2011, NTCIR.

[2] Yuen-Hsien Tseng,et al. 專利雙語語料之中、英對照詞自動擷取 (Automatic Term Pair Extraction from Bilingual Patent Corpus) [In Chinese] , 2009, ROCLING/IJCLCLP.

[3] Xiaoyi Ma,et al. Champollion: A Robust Parallel Text Sentence Aligner , 2006, LREC.

[4] Masao Utiyama,et al. Overview of the Patent Translation Task at the NTCIR-7 Workshop , 2008, NTCIR.