Meteor, M-BLEU and M-TER: Evaluation Metrics for High-Correlation with Human Rankings of Machine Translation Output

This paper describes our submissions to the machine translation evaluation shared task in ACL WMT-08. Our primary submission is the Meteor metric tuned for optimizing correlation with human rankings of translation hypotheses. We show significant improvement in correlation as compared to the earlier version of metric which was tuned to optimized correlation with traditional adequacy and fluency judgments. We also describe m-bleu and m-ter, enhanced versions of two other widely used metrics bleu and ter respectively, which extend the exact word matching used in these metrics with the flexible matching based on stemming and Wordnet in Meteor.

[1]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[2]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[3]  I. Dan Melamed,et al.  Precision and Recall of Machine Translation , 2003, NAACL.

[4]  Chin-Yew Lin,et al.  ORANGE: a Method for Evaluating Automatic Evaluation Metrics for Machine Translation , 2004, COLING.

[5]  Alon Lavie,et al.  The significance of recall in automatic metrics for MT evaluation , 2004, AMTA.

[6]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[7]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[8]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[9]  Hermann Ney,et al.  CDER: Efficient MT Evaluation Using Block Movements , 2006, EACL.

[10]  Ding Liu,et al.  Stochastic Iterative Alignment for Machine Translation Evaluation , 2006, ACL.

[11]  Philipp Koehn,et al.  (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.

[12]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[13]  Ming Zhou,et al.  Sentence Level Machine Translation Evaluation as a Ranking , 2007, WMT@ACL.

[14]  Wolfgang Macherey,et al.  Lattice-based Minimum Error Rate Training for Statistical Machine Translation , 2008, EMNLP.