An improvement of translation quality with adding key-words in parallel corpus

In this paper, we propose a new approach to improve the translation quality by adding the Key-Words of a sentence to the parallel corpus. The main idea of the approach is to find the key-words of sentences that cannot be properly translated by the model, and then put it or them in the training corpus in a separated line as a sentence. During our experiment, we use two statistical machine translation (SMT) systems, word-based SMT (ISI-rewrite) and phrase-based SMT (Moses), and a small parallel corpus (4,000 sentences) to check our assumption. To our glad, we get a better BLEU score than the original parallel text. It can improve about 6% in word-based SMT (isi-rewrite) and 4% in phrased-based SMT (Moses). At last we build a 120,000 English-Chinese parallel corpus in this way.

[1]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[2]  Dekai Wu,et al.  Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora , 1997, CL.

[3]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[4]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[5]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[6]  Daniel Marcu,et al.  A Phrase-Based,Joint Probability Model for Statistical Machine Translation , 2002, EMNLP.

[7]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[8]  H. Alshawi,et al.  Automatic Acquisition of Hierarchical Transduction Models for Machine Translation , 2022, COLING.

[9]  Hermann Ney,et al.  The Alignment Template Approach to Statistical Machine Translation , 2004, CL.

[10]  Daniel Marcu,et al.  Scalable Inference and Training of Context-Rich Syntactic Translation Models , 2006, ACL.

[11]  Chris Callison-Burch,et al.  Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .

[12]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[13]  Sergei Nirenburg,et al.  A Statistical Approach to Machine Translation , 2003 .

[14]  Srinivas Bangalore,et al.  Automatic Acquisition of Hierarchical Transduction Models for Machine Translation , 1998, COLING-ACL.

[15]  Philipp Koehn,et al.  A parallel corpus for statistical machine translation , 2005 .

[16]  Kevin Knight,et al.  A Syntax-based Statistical Translation Model , 2001, ACL.