论文信息 - Measuring confidence intervals for the machine translation evaluation metrics

Measuring confidence intervals for the machine translation evaluation metrics

Automatic evaluation metrics for Machine Translation (MT) systems, such as BLEU and the related NIST metric, are becoming increasingly important in MT. This paper reports a novel method of calculating the confidence intervals for BLEU/NIST scores using bootstrapping. With this method, we can determine whether two MT systems are significantly different from each other. We study the effect of test set size and number of reference translations on the confidence intervals for these MT evaluation metrics.

Ying Zhang | Stephan Vogel | Y. Zhang | S. Vogel

[1] I. Dan Melamed,et al. Precision and Recall of Machine Translation , 2003, NAACL.

[2] Christopher Culy,et al. The limits of n-gram translation evaluation metrics , 2003, MTSUMMIT.

[3] Ying Zhang,et al. Interpreting BLEU/NIST Scores: How Much Improvement do We Need to Have a Better System? , 2004, LREC.

[4] H. Ney,et al. A novel string-to-string distance measure with applications to machine translation evaluation , 2003, MTSUMMIT.

[5] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[6] Hermann Ney,et al. An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research , 2000, LREC.

[7] Hermann Ney,et al. Bootstrap estimates for confidence intervals in ASR performance evaluation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8] M. King,et al. FEMTI: creating and using a framework for MT evaluation , 2003, MTSUMMIT.