Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation

Abstract Human evaluation of modern high-quality machine translation systems is a difficult problem, and there is increasing evidence that inadequate evaluation procedures can lead to erroneous conclusions. While there has been considerable research on human evaluation, the field still lacks a commonly accepted standard procedure. As a step toward this goal, we propose an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics (MQM) framework. We carry out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs using annotations provided by professional translators with access to full document context. We analyze the resulting data extensively, finding among other results a substantially different ranking of evaluated systems from the one established by the WMT crowd workers, exhibiting a clear preference for human over machine output. Surprisingly, we also find that automatic metrics based on pre-trained embeddings can outperform human crowd workers. We make our corpus publicly available for further research.

[1]  Andy Way,et al.  A Comparative Quality Evaluation of PBSMT and NMT using Professional Translators , 2017, MTSUMMIT.

[2]  Philipp Koehn,et al.  Statistical Power and Translationese in Machine Translation Evaluation , 2020, EMNLP.

[3]  Philipp Koehn,et al.  Proceedings of the Third Workshop on Statistical Machine Translation , 2008, WMT@ACL.

[4]  Philipp Koehn,et al.  Findings of the 2017 Conference on Machine Translation (WMT17) , 2017, WMT.

[5]  John S. White,et al.  The ARPA MT Evaluation Methodologies: Evolution, Lessons, and Future Approaches , 1994, AMTA.

[6]  Antonio Toral,et al.  Quantitative fine-grained human evaluation of machine translation systems: a case study on English to Croatian , 2018, Machine Translation.

[7]  Andy Way,et al.  Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation , 2018, WMT.

[8]  Karin M. Verspoor,et al.  Findings of the 2016 Conference on Machine Translation , 2016, WMT.

[9]  Markus Freitag,et al.  APE at Scale and Its Implications on MT Evaluation Biases , 2019, WMT.

[10]  Markus Freitag,et al.  BLEU Might Be Guilty but References Are Not Innocent , 2020, EMNLP.

[11]  Antonio Toral,et al.  Reassessing Claims of Human Parity and Super-Human Performance in Machine Translation at WMT 2019 , 2020, EAMT.

[12]  Maja Popović,et al.  Informative Manual Evaluation of Machine Translation Output , 2020, COLING.

[13]  Rico Sennrich,et al.  Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation , 2018, EMNLP.

[14]  Alon Lavie,et al.  COMET: A Neural Framework for MT Evaluation , 2020, EMNLP.

[15]  Lucia Specia,et al.  Exploring gap filling as a cheaper alternative to reading comprehension questionnaires when evaluating machine translation for gisting , 2018, WMT.

[16]  Philipp Koehn,et al.  Findings of the 2020 Conference on Machine Translation (WMT20) , 2020, WMT.

[17]  Lijun Wu,et al.  Achieving Human Parity on Automatic Chinese to English News Translation , 2018, ArXiv.

[18]  Hans Uszkoreit,et al.  Involving language professionals in the evaluation of machine translation , 2014, Lang. Resour. Evaluation.

[19]  Hermann Ney,et al.  Human Evaluation of Machine Translation Through Binary System Comparisons , 2007, WMT@ACL.

[20]  Antonio Toral,et al.  The Effect of Translationese in Machine Translation Test Sets , 2019, WMT.

[21]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[22]  Samuel Läubli,et al.  What’s the Difference Between Professional Human and Machine Translation? A Blind Multi-language Study on Domain-specific MT , 2020, EAMT.

[23]  Timothy Baldwin,et al.  Continuous Measurement Scales in Human Evaluation of Machine Translation , 2013, LAW@ACL.

[24]  Lucia Specia,et al.  A Reading Comprehension Corpus for Machine Translation Evaluation , 2016, LREC.

[25]  Ehud Reiter,et al.  A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems , 2020, INLG.

[26]  Mauro Cettolo,et al.  Machine Translation Human Evaluation: an investigation of evaluation based on Post-Editing and its relation with Direct Assessment , 2018, IWSLT.

[27]  Philipp Koehn,et al.  Manual and Automatic Evaluation of Machine Translation between European Languages , 2006, WMT@HLT-NAACL.

[28]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[29]  Markus Freitag,et al.  Results of the WMT20 Metrics Shared Task , 2020, WMT.

[30]  Timothy Baldwin,et al.  Can machine translation systems be evaluated by the crowd alone , 2015, Natural Language Engineering.

[31]  M. Fomicheva The Role of human reference translation in machine translation evaluation , 2017 .

[32]  A. Burchardt,et al.  Multidimensional Quality Metrics (MQM): A Framework for Declaring and Describing Translation Quality Metrics , 2014 .

[33]  Antonio Toral,et al.  A Set of Recommendations for Assessing Human-Machine Parity in Language Translation , 2020, J. Artif. Intell. Res..

[34]  Udo Kruschwitz,et al.  Comparing Bayesian Models of Annotation , 2018, TACL.

[35]  Moshe Koppel,et al.  Translationese and Its Dialects , 2011, ACL.

[36]  John R. Pierce,et al.  Language and Machines: Computers in Translation and Linguistics , 1966 .