Neutralizing the Effect of Translation Shifts on Automatic Machine Translation Evaluation

State-of-the-art automatic Machine Translation [MT] evaluation is based on the idea that the closer MT output is to Human Translation [HT], the higher its quality. Thus, automatic evaluation is typically approached by measuring some sort of similarity between machine and human translations. Most widely used evaluation systems calculate similarity at surface level, for example, by computing the number of shared word n-grams. The correlation between automatic and manual evaluation scores at sentence level is still not satisfactory. One of the main reasons is that metrics underscore acceptable candidate translations due to their inability to tackle lexical and syntactic variation between possible translation options. Acceptable differences between candidate and reference translations are frequently due to optional translation shifts. It is common practice in HT to paraphrase what could be viewed as close version of the source text in order to adapt it to target language use. When a reference translation contains such changes, using it as the only point of comparison is less informative, as the differences are not indicative of MT errors. To alleviate this problem, we design a paraphrase generation system based on a set of rules that model prototypical optional shifts that may have been applied by human translators. Applying the rules to the available human reference, the system generates additional references in a principled and controlled way. We show how using linguistic rules for the generation of additional references neutralizes the negative effect of optional translation shifts on n-gram-based MT evaluation.

[1]  Chris Callison-Burch,et al.  Paraphrasing with Bilingual Parallel Corpora , 2005, ACL.

[2]  Nello Cristianini,et al.  Estimating the Sentence-Level Quality of Machine Translation Systems , 2009, EAMT.

[3]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[4]  Rebecca Hwa,et al.  Regression for machine translation evaluation at the sentence level , 2008, Machine Translation.

[5]  Alon Lavie,et al.  Meteor Universal: Language Specific Translation Evaluation for Any Target Language , 2014, WMT@ACL.

[6]  Alberto Barrón-Cedeño,et al.  Plagiarism Meets Paraphrasing: Insights for the Next Generation in Automatic Plagiarism Detection , 2013, CL.

[7]  Monika Doherty Language Processing in Discourse: A Key to Felicitous Translation , 2002 .

[8]  Montserrat Marimon,et al.  The Spanish DELPH-IN grammar , 2012, Language Resources and Evaluation.

[9]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[10]  Eduard H. Hovy,et al.  Squibs: What Is a Paraphrase? , 2013, CL.

[11]  Lea Cyrus,et al.  Building a resource for studying translation shifts , 2006, LREC.

[12]  K. V. Leuven-Zwart Translation and Original: Similarities and Dissimilarities, II , 1989 .

[13]  Montserrat Marimon,et al.  Automatic Selection of HPSG-Parsed Sentences for Treebank Construction , 2014, Computational Linguistics.

[14]  Andy Way,et al.  Contextual Bitext-Derived Paraphrases in Automatic MT Evaluation , 2006, WMT@HLT-NAACL.

[15]  Lars Ahrenberg Codified close translation as a standard for MT , 2005, EAMT.

[16]  Joakim Nivre,et al.  MaltParser: A Language-Independent System for Data-Driven Dependency Parsing , 2007, Natural Language Engineering.