Re-evaluating the Role of Bleu in Machine Translation Research

We argue that the machine translation community is overly reliant on the Bleu machine translation evaluation metric. We show that an improved Bleu score is neither necessary nor sufficient for achieving an actual improvement in translation quality, and give two significant counterexamples to Bleu’s correlation with human judgments of quality. This offers new potential for research which was previously deemed unpromising by an inability to improve upon Bleu scores.

[1]  H. Thompson Thompson NEW DIRECTIONS : Automatic Evaluation of Translation Quality : Outline of Methodology and Report on Pilot Experiment , 1991 .

[2]  Hermann Ney,et al.  Discriminative Training and Maximum Entropy Models for Statistical Machine Translation , 2002, ACL.

[3]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[4]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[5]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[6]  Eduard H. Hovy,et al.  Automatic Evaluation of Summaries Using N-gram Co-occurrence Statistics , 2003, NAACL.

[7]  I. Dan Melamed,et al.  Precision and Recall of Machine Translation , 2003, NAACL.

[8]  Eduard Hovy,et al.  Holy and unholy grails , 2003, MTSUMMIT.

[9]  Deborah A. Coughlin,et al.  Correlating automated and human assessments of machine translation quality , 2003, MTSUMMIT.

[10]  Bogdan Babych,et al.  Extending the BLEU MT Evaluation Method with Frequency Weightings , 2004, ACL.

[11]  Philipp Koehn,et al.  Pharaoh: A Beam Search Decoder for Phrase-Based Statistical Machine Translation Models , 2004, AMTA.

[12]  Alexander M. Fraser,et al.  A Smorgasbord of Features for Statistical Machine Translation , 2004, NAACL.

[13]  Eric Brill,et al.  A Unified Framework For Automatic Evaluation Using 4-Gram Co-occurrence Statistics , 2004, ACL.

[14]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[15]  Philipp Koehn,et al.  A parallel corpus for statistical machine translation , 2005 .

[16]  Chris Callison-Burch Linear B System Description for the 2005 NIST MT Evaluation Exercise , 2005 .

[17]  Marine Carpuat,et al.  Word Sense Disambiguation vs. Statistical Machine Translation , 2005, ACL.

[18]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[19]  Mark Przybocki,et al.  NIST 2005 machine translation evaluation official results , 2005 .

[20]  Philip Koehn,et al.  Statistical Machine Translation , 2010, EAMT.