Contrastive Lexical Evaluation of Machine Translation

This paper advocates a complementary measure of translation performance that focuses on the constrastive ability of two or more systems or system versions to adequately translate source words. This is motivated by three main reasons : 1) existing automatic metrics sometimes do not show significant differences that can be revealed by fine-grained focussed human evaluation, 2) these metrics are based on direct comparisons between system hypotheses with the corresponding reference translations, thus ignoring the input words that were actually translated, and 3) as these metrics do not take input hypotheses from several systems at once, fine-grained contrastive evaluation can only be done indirectly. This proposal is illustrated on a multi-source Machine Translation scenario where multiple translations of a source text are available. Significant gains (up to +1.3 BLEU point) are achieved on these experiments, and contrastive lexical evaluation is shown to provide new information that can help to better analyse a system's performance.

[1]  Hermann Ney,et al.  Error Analysis of Statistical Machine Translation Output , 2006, LREC.

[2]  Marianna Apidianaki,et al.  Data-Driven Semantic Analysis for Multilingual WSD and Lexical Selection in Translation , 2009, EACL.

[3]  José B. Mariño,et al.  N-gram-based Machine Translation , 2006, CL.

[4]  Philipp Koehn,et al.  Findings of the 2009 Workshop on Statistical Machine Translation , 2009, WMT@EACL.

[5]  Francisco Casacuberta,et al.  Machine Translation with Inferred Stochastic Finite-State Transducers , 2004, Computational Linguistics.

[6]  Philipp Koehn,et al.  Word Lattices for Multi-Source Translation , 2009, EACL.

[7]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[8]  John A. Nelder,et al.  A Simplex Method for Function Minimization , 1965, Comput. J..

[9]  Hermann Ney,et al.  Statistical multi-source translation , 2001, MTSUMMIT.

[10]  Lane Schwartz,et al.  Multi-Source Translation Methods , 2008, AMTA.

[11]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[12]  Marine Carpuat,et al.  Evaluation of Context-Dependent Phrasal Translation Lexicons for Statistical Machine Translation , 2008, LREC.

[13]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[14]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[15]  Nitin Madnani,et al.  Fluency, Adequacy, or HTER? Exploring Different Human Judgments with a Tunable MT Metric , 2009, WMT@EACL.