How much data is needed for reliable MT evaluation? Using bootstrapping to study human and automatic metrics

Evaluating the output quality of machine translation system requires test data and quality metrics to be applied. Based on the results of the French MT evaluation campaign CESTA, this paper studies the statistical reliability of the scores depending on the amount of test data used to obtain them. Bootstrapping is used to compute standard deviation of scores assigned by human judges (mainly of adequacy) as well as of five automatic metrics. The reliability of the scores is measured using two formal criteria, and the minimal number of documents or segments needed to reach reliable scores is estimated. This number does not depend on the exact subset of documents that is used.

[1]  Jimmy J. Lin,et al.  Web question answering: is more always better? , 2002, SIGIR '02.

[2]  Alexander H. Waibel,et al.  Low Cost Portability for Statistical Machine Translation based on N-gram Frequency and TF-IDF , 2005, IWSLT.

[3]  Charles L. A. Clarke,et al.  The impact of corpus size on question answering performance , 2002, SIGIR '02.

[4]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[5]  Hermann Ney,et al.  An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research , 2000, LREC.

[6]  Cyril Goutte Automatic Evaluation of Machine Translation Quality , 2006 .

[7]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[8]  Joseph P. Turian,et al.  Evaluation of machine translation and its evaluation , 2003, MTSUMMIT.

[9]  Ying Zhang,et al.  Measuring confidence intervals for the machine translation evaluation metrics , 2004, TMI.

[10]  Ulrich Germann Building a Statistical Machine Translation System from Scratch: How Much Bang for the Buck Can We Expect? , 2001, DDMMT@ACL.

[11]  Philipp Koehn,et al.  Re-evaluating the Role of Bleu in Machine Translation Research , 2006, EACL.

[12]  Eric Atwell,et al.  Rationale for a multilingual corpus for machine translation evaluation , 2003 .

[13]  Hermann Ney,et al.  Accelerated DP based search for statistical translation , 1997, EUROSPEECH.

[14]  Alex Waibel,et al.  Low Cost Portability for Statistical Machine Translation based on N-gram Coverage , 2005 .

[15]  B. Efron,et al.  A Leisurely Look at the Bootstrap, the Jackknife, and , 1983 .

[16]  Paula Estrella,et al.  Finding the System that Suits You Best: Towards the Normalization of MT Evaluation , 2005, TC.

[17]  Andrei Popescu-Belis,et al.  CESTA: First Conclusions of the Technolangue MT Evaluation Campaign , 2006, LREC.

[18]  Hermann Ney,et al.  Bootstrap estimates for confidence intervals in ASR performance evaluation , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Shankar Kumar,et al.  Minimum Bayes-Risk Decoding for Statistical Machine Translation , 2004, NAACL.