Metric and reference factors in minimum error rate training

In Minimum Error Rate Training (MERT), Bleu is often used as the error function, despite the fact that it has been shown to have a lower correlation with human judgment than other metrics such as Meteor and Ter. In this paper, we present empirical results in which parameters tuned on Bleu may lead to sub-optimal Bleu scores under certain data conditions. Such scores can be improved significantly by tuning on an entirely different metric altogether, e.g. Meteor, by 0.0082 Bleu or 3.38% relative improvement on the WMT08 English–French data. We analyze the influence of the number of references and choice of metrics on the result of MERT and experiment on different data sets. We show the problems of tuning on a metric that is not designed for the single reference scenario and point out some possible solutions.

[1]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[2]  Jean Carletta,et al.  Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization , 2005, ACL 2005.

[3]  Vladimir Eidelman,et al.  The University of Maryland Statistical Machine Translation System for the Fourth Workshop on Machine Translation , 2009, WMT@ACL.

[4]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[5]  Vladimir I. Levenshtein,et al.  Binary codes capable of correcting deletions, insertions, and reversals , 1965 .

[6]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[7]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[8]  Chris Quirk,et al.  Random Restarts in Minimum Error Rate Training for Statistical Machine Translation , 2008, COLING.

[9]  Daniel Jurafsky,et al.  Regularization and Search for Minimum Error Rate Training , 2008, WMT@ACL.

[10]  Andy Way,et al.  Labelled Dependencies in Machine Translation Evaluation , 2007, WMT@ACL.

[11]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[12]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[13]  Yifan He,et al.  Improving the Objective Function in Minimum Error Rate Training , 2009, MTSUMMIT.

[14]  Marta R. Costa-jussà,et al.  MACHINE TRANSLATION SYSTEM DEVELOPMENT BASED ON HUMAN LIKENESS , 2006, 2006 IEEE Spoken Language Technology Workshop.

[15]  Philipp Koehn,et al.  Further Meta-Evaluation of Machine Translation , 2008, WMT@ACL.

[16]  Wolfgang Macherey,et al.  Lattice-based Minimum Error Rate Training for Statistical Machine Translation , 2008, EMNLP.

[17]  Ralph Weischedel,et al.  A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[18]  Hermann Ney,et al.  Discriminative Training and Maximum Entropy Models for Statistical Machine Translation , 2002, ACL.

[19]  Hwee Tou Ng,et al.  Decomposability of Translation Metrics for Improved Evaluation and Efficient Algorithms , 2008, EMNLP.