论文信息 - Interpreting BLEU/NIST Scores: How Much Improvement do We Need to Have a Better System?

Interpreting BLEU/NIST Scores: How Much Improvement do We Need to Have a Better System?

Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU and the related NIST metric, are becoming increasingly important in MT. Yet, their behaviors are not fully understood. In this paper, we analyze some flaws in the BLEU/NIST metrics. With a better understanding of these problems, we can better interpret the reported BLEU/NIST scores. In addition, this paper reports a novel method of calculating the confidence intervals for BLEU/NIST scores using bootstrapping. With this method, we can determine whether two MT systems are significantly different from each other.

[1] Robert Tibshirani,et al. Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy , 1986 .

[2] George R. Doddington,et al. Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[3] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[4] Andrei Popescu-Belis. An experiment in comparative evaluation: humans vs. computers , 2003, MTSUMMIT.

[5] Christopher Culy,et al. The limits of n-gram translation evaluation metrics , 2003, MTSUMMIT.