Meta-Evaluation of a Diagnostic Quality Metric for Machine Translation

Diagnostic evaluation of machine translation (MT) is an approach to evaluation that provides finer-grained information compared to state-of-the-art automatic metrics. This paper evaluates DELiC4MT, a diagnostic metric that assesses the performance of MT systems on user-defined linguistic phenomena. We present the results obtained using this diagnostic metric when evaluating three MT systems that translate from English to French, with a comparison against both human judgements and a set of representative automatic evaluation metrics. In addition, as the diagnostic metric relies on word alignments, the paper compares the margin of error in diagnostic evaluation when using automatic word alignments as opposed to gold standard manual alignments. We observed that this diagnostic metric is capable of accurately reflecting translation quality, can be used reliably with automatic word alignments and, in general, correlates well with automatic metrics and, more importantly, with human judgements.

[1]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[2]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[3]  Tiejun Zhao,et al.  Diagnostic Evaluation of Machine Translation Systems Using Automatically Constructed Linguistic Check-Points , 2008, COLING.

[4]  Ted Pedersen,et al.  An Evaluation Exercise for Word Alignment , 2003, ParallelTexts@NAACL-HLT.

[5]  Beatrice Santorini,et al.  Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision) , 1990 .

[6]  José B. Mariño,et al.  Morpho-syntactic Information for Automatic Error Analysis of Statistical Machine Translation Output , 2006, WMT@HLT-NAACL.

[7]  Beatrice Santorini Part-of-speech tagging guidelines for the penn treebank project , 1990 .

[8]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[9]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[10]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[11]  Hermann Ney,et al.  A Comparison of Alignment Models for Statistical Machine Translation , 2000, COLING.

[12]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[13]  Maja Popovic Hjerson: An Open Source Tool for Automatic Error Classification of Machine Translation Output , 2011, Prague Bull. Math. Linguistics.

[14]  Aljoscha Burchardt,et al.  From Human to Automatic Error Classification for Machine Translation Output , 2011, EAMT.

[15]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[16]  François Yvon,et al.  Contrastive Lexical Evaluation of Machine Translation , 2010, LREC.

[17]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[18]  Antonio Toral,et al.  DELiC4MT: A Tool for Diagnostic MT Evaluation over User-defined Linguistic Phenomena , 2012, Prague Bull. Math. Linguistics.

[19]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[20]  I. Dan Melamed,et al.  Manual Annotation of Translational Equivalence: The Blinker Project , 1998, ArXiv.

[21]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[22]  Ondrej Bojar,et al.  Terra: a Collection of Translation Error-Annotated Corpora , 2012, LREC.