A Structured Review of the Validity of BLEU

The BLEU metric has been widely used in NLP for over 15 years to evaluate NLP systems, especially in machine translation and natural language generation. I present a structured review of the evidence on whether BLEU is a valid evaluation technique—in other words, whether BLEU scores correlate with real-world utility and user-satisfaction of NLP systems; this review covers 284 correlations reported in 34 papers. Overall, the evidence supports using BLEU for diagnostic evaluation of MT systems (which is what it was originally proposed for), but does not support using BLEU outside of MT, for evaluation of individual texts, or for scientific hypothesis testing.

[1]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[2]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[3]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[4]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[5]  Anja Belz,et al.  Comparing Automatic and Human Evaluation of NLG Systems , 2006, EACL.

[6]  Albert Gatt,et al.  Intrinsic vs. Extrinsic Evaluation Measures for Referring Expression Generation , 2008, ACL.

[7]  Anja Belz,et al.  An Investigation into the Validity of Some Metrics for Automatically Evaluating Natural Language Generation Systems , 2009, CL.

[8]  D. Moher,et al.  Preferred reporting items for systematic reviews and meta-analyses: the PRISMA statement , 2009, BMJ.

[9]  Michael White,et al.  Further Meta-Evaluation of Broad-Coverage Surface Realization , 2010, EMNLP.

[10]  Kemal Oflazer,et al.  A Human Judgement Corpus and a Metric for Arabic MT Evaluation , 2014, EMNLP.

[11]  Yvette Graham,et al.  Re-evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE , 2015, EMNLP.

[12]  Ondrej Bojar,et al.  Results of the WMT13 Metrics Shared Task , 2015, WMT@EMNLP.

[13]  V. Prasad,et al.  The Strength of Association Between Surrogate End Points and Survival in Oncology: A Systematic Review of Trial-Level Meta-analyses. , 2015, JAMA internal medicine.

[14]  Karin M. Verspoor,et al.  Findings of the 2016 Conference on Machine Translation , 2016, WMT.

[15]  Verena Rieser,et al.  Why We Need New Evaluation Metrics for NLG , 2017, EMNLP.

[16]  Mert Kilickaya,et al.  Re-evaluating Automatic Metrics for Image Captioning , 2016, EACL.