Results of the WMT20 Metrics Shared Task

This paper presents the results of the WMT20 Metrics Shared Task. Participants were asked to score the outputs of the translation systems competing in the WMT20 News Translation Task with automatic metrics. Ten research groups submitted 27 metrics, four of which are reference-less “metrics”. In addition, we computed five baseline metrics, including SENTBLEU, BLEU, TER and CHRF using the SacreBLEU scorer. All metrics were evaluated on how well they correlate at the system-, documentand segment-level with the WMT20 official human scores. We present an extensive analysis on influence of reference translations on metric reliability, how well automatic metrics score human translations, and we also flag major discrepancies between metric and human scores when evaluating MT systems. Finally, we investigate whether we can use automatic metrics to flag incorrect human ratings.

[1]  Timothy Baldwin,et al.  Randomized Significance Tests in Machine Translation , 2014, WMT@ACL.

[2]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[3]  Markus Freitag,et al.  BLEU Might Be Guilty but References Are Not Innocent , 2020, EMNLP.

[4]  Matthew G. Snover,et al.  A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[5]  Philipp Koehn,et al.  Findings of the 2020 Conference on Machine Translation (WMT20) , 2020, WMT.

[6]  Mia Hubert,et al.  Robust statistics for outlier detection , 2011, WIREs Data Mining Knowl. Discov..

[7]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[8]  Ondrej Bojar,et al.  Results of the WMT16 Metrics Shared Task , 2016 .

[9]  Manish Shrivastava,et al.  MEE : An Automatic Metric for Evaluation Using Embeddings for Machine Translation , 2020, 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA).

[10]  Ondrej Bojar,et al.  Results of the WMT14 Metrics Shared Task , 2013 .

[11]  Maja Popovic,et al.  chrF++: words helping character n-grams , 2017, WMT.

[12]  Timothy Baldwin,et al.  Testing for Significance of Increased Correlation with Human Judgment , 2014, EMNLP.

[13]  Matt Post,et al.  ParBLEU: Augmenting Metrics with Automatic Paraphrases for the WMT’20 Metrics Shared Task , 2020, WMT.

[14]  Alon Lavie,et al.  Unbabel’s Participation in the WMT20 Metrics Shared Task , 2020, WMT.

[15]  Hermann Ney,et al.  EED: Extended Edit Distance Measure for Machine Translation , 2019, WMT.

[16]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[17]  Chi-kiu Lo Extended Study on Using Pretrained Language Models and YiSi-1 for Machine Translation Evaluation , 2020, WMT@EMNLP.

[18]  Matt Post,et al.  Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing , 2020, EMNLP.

[19]  Alon Lavie,et al.  COMET: A Neural Framework for MT Evaluation , 2020, EMNLP.

[20]  Christophe Ley,et al.  Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the median , 2013 .

[21]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[22]  S. Lewis,et al.  Regression analysis , 2007, Practical Neurology.

[23]  Teri A. Crosby,et al.  How to Detect and Handle Outliers , 1993 .

[24]  André F. T. Martins,et al.  OpenKiwi: An Open Source Framework for Quality Estimation , 2019, ACL.

[25]  Andy Way,et al.  Attaining the Unattainable? Reassessing Claims of Human Parity in Neural Machine Translation , 2018, WMT.

[26]  Antonio Toral,et al.  A Set of Recommendations for Assessing Human-Machine Parity in Language Translation , 2020, J. Artif. Intell. Res..

[27]  Markus Freitag,et al.  Learning to Evaluate Translation Beyond English: BLEURT Submissions to the WMT Metrics 2020 Shared Task , 2020, WMT@EMNLP.

[28]  Philipp Koehn,et al.  Findings of the WMT 2020 Shared Task on Parallel Corpus Filtering and Alignment , 2020, WMT.

[29]  Chi-kiu Lo,et al.  YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources , 2019, WMT.

[30]  Timothy Baldwin,et al.  Putting Evaluation in Context: Contextual Embeddings Improve Machine Translation Evaluation , 2019, ACL.

[31]  Daniel Marcu,et al.  HyTER: Meaning-Equivalent Semantics for Translation Evaluation , 2012, NAACL.

[32]  Zhen-Hua Ling,et al.  Enhanced LSTM for Natural Language Inference , 2016, ACL.

[33]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[34]  Hermann Ney,et al.  CharacTer: Translation Edit Rate on Character Level , 2016, WMT.

[35]  Ondrej Bojar,et al.  Results of the WMT18 Metrics Shared Task: Both characters and embeddings achieve good performance , 2018, WMT.

[36]  Ondrej Bojar,et al.  Results of the WMT17 Metrics Shared Task , 2017, WMT.

[37]  Timothy Baldwin,et al.  Accurate Evaluation of Segment-level Machine Translation Metrics , 2015, NAACL.

[38]  Junfeng Hu,et al.  Incorporate Semantic Structures into Machine Translation Evaluation via UCCA , 2020, WMT@EMNLP.

[39]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[40]  Timothy Baldwin,et al.  Continuous Measurement Scales in Human Evaluation of Machine Translation , 2013, LAW@ACL.

[41]  Michal Novák,et al.  SAO WMT19 Test Suite: Machine Translation of Audit Reports , 2019, WMT.

[42]  Timothy Baldwin,et al.  Improving Evaluation of Document-level Machine Translation Quality Estimation , 2017, EACL.

[43]  Samuel Larkin,et al.  Machine Translation Reference-less Evaluation using YiSi-2 with Bilingual Mappings of Massive Multilingual Language Model , 2020, WMT@EMNLP.

[44]  Ondrej Bojar,et al.  Results of the WMT19 Metrics Shared Task: Segment-Level and Strong MT Systems Pose Big Challenges , 2019, WMT.

[45]  Ondrej Bojar,et al.  Scratching the Surface of Possible Translations , 2013, TSD.

[46]  Nitika Mathur,et al.  Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics , 2020, ACL.