Results of WMT22 Metrics Shared Task: Stop Using BLEU – Neural Metrics Are Better and More Robust

This paper presents the results of the WMT22 Metrics Shared Task. Participants submitting automatic MT evaluation metrics were asked to score the outputs of the translation systems competing in the WMT22 News Translation Task on four different domains: news, social, ecommerce, and chat. All metrics were evaluated on how well they correlate with human ratings at the system and segment level.Similar to last year, we acquired our own human ratings based on expert-based human evaluation via Multidimensional Quality Metrics (MQM). This setup had several advantages, among other things: (i) expert-based evaluation is more reliable, (ii) we extended the pool of translations by 5 additional translations based on MBR decoding or rescoring which are challenging for current metrics. In addition, we initiated a challenge set subtask, where participants had to create contrastive test suites for evaluating metrics’ ability to capture and penalise specific types of translation errors.Finally, we present an extensive analysis on how well metrics perform on three language pairs: English to German, English to Russian and Chinese to English. The results demonstrate the superiority of neural-based learned metrics and demonstrate again that overlap metrics like Bleu, spBleu or chrf correlate poorly with human ratings. The results also reveal that neural-based metrics are remarkably robust across different domains and challenges.

[1]  Liane Guillou,et al.  ACES: Translation Accuracy Challenge Sets for Evaluating Machine Translation Metrics , 2022, WMT.

[2]  Lidia S. Chao,et al.  Alibaba-Translate China’s Submission for WMT2022 Metrics Shared Task , 2022, WMT.

[3]  William Yang Wang,et al.  Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis , 2022, EMNLP.

[4]  Shannon L. Spruit,et al.  No Language Left Behind: Scaling Human-Centered Machine Translation , 2022, ArXiv.

[5]  José G. C. de Souza,et al.  Quality-Aware Decoding for Neural Machine Translation , 2022, NAACL.

[6]  Lidia S. Chao,et al.  UniTE: Unified Translation Evaluation , 2022, ACL.

[7]  S. Clémençon,et al.  What are the best systems? New perspectives on NLP Benchmarking , 2022, ArXiv.

[8]  David Grangier,et al.  High Quality Rather than High Model Probability: Minimum Bayes Risk Decoding with Neural Metrics , 2021, Transactions of the Association for Computational Linguistics.

[9]  Marc'Aurelio Ranzato,et al.  The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation , 2021, TACL.

[10]  Alessandro Sciré,et al.  MaTESe: Machine Translation Evaluation as a Sequence Tagging Problem , 2022, WMT.

[11]  Pengfei Li,et al.  Partial Could Be Better than Whole. HW-TSC 2022 Submission for the Metrics Shared Task , 2022, WMT.

[12]  A. Lavie,et al.  COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task , 2022, WMT.

[13]  Manish Shrivastava,et al.  REUSE: REference-free UnSupervised Quality Estimation Metric , 2022, WMT.

[14]  Manish Shrivastava,et al.  Unsupervised Embedding-based Metric for MT Evaluation with Improved Human Correlation , 2022, WMT.

[15]  Hao Yang,et al.  Exploring Robustness of Machine Translation Metrics: A Study of Twenty-Two Automatic Metrics in the WMT22 Metric Task , 2022, WMT.

[16]  Eleftherios Avramidis,et al.  Linguistically Motivated Evaluation of Machine Translation Metrics Based on a Challenge Set , 2022, WMT.

[17]  André F. T. Martins,et al.  Robust MT Evaluation with Sentence-level Multilingual Augmentation , 2022, WMT.

[18]  Philipp Koehn,et al.  Findings of the 2022 Conference on Machine Translation (WMT22) , 2022, WMT.

[19]  Eleftherios Avramidis,et al.  A Linguistically Motivated Test Suite to Semi-Automatically Evaluate German–English Machine Translation Output , 2022, LREC.

[20]  Sebastian Gehrmann,et al.  Learning Compact Metrics for MT , 2021, EMNLP.

[21]  Marcin Junczys-Dowmunt,et al.  To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation , 2021, WMT.

[22]  Rico Sennrich,et al.  Understanding the Properties of Minimum Bayes Risk Decoding in Neural Machine Translation , 2021, ACL.

[23]  Markus Freitag,et al.  Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation , 2021, Transactions of the Association for Computational Linguistics.

[24]  Dan Roth,et al.  A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods , 2021, Transactions of the Association for Computational Linguistics.

[25]  Holger Schwenk,et al.  Beyond English-Centric Multilingual Machine Translation , 2020, J. Mach. Learn. Res..

[26]  Alon Lavie,et al.  Are References Really Needed? Unbabel-IST 2021 Submission for the Metrics Shared Task , 2021, WMT.

[27]  A. Lavie,et al.  Results of the WMT21 Metrics Shared Task: Evaluating Metrics with Expert-based Human Evaluations on TED and News Domain , 2021, WMT.

[28]  Wilker Aziz,et al.  Sampling-Based Minimum Bayes Risk Decoding for Neural Machine Translation , 2021, ArXiv.

[29]  Vishrav Chaudhary,et al.  Multilingual Translation from Denoising Pre-Training , 2021, FINDINGS.

[30]  Manish Shrivastava,et al.  MEE : An Automatic Metric for Evaluation Using Embeddings for Machine Translation , 2020, 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA).

[31]  Alon Lavie,et al.  COMET: A Neural Framework for MT Evaluation , 2020, EMNLP.

[32]  Nitika Mathur,et al.  Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics , 2020, ACL.

[33]  Wilker Aziz,et al.  Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation , 2020, COLING.

[34]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[35]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[36]  Kilian Q. Weinberger,et al.  BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.

[37]  Chi-kiu Lo,et al.  YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources , 2019, WMT.

[38]  Myle Ott,et al.  Facebook FAIR’s WMT19 News Translation Task Submission , 2019, WMT.

[39]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[40]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[41]  Mani B. Srivastava,et al.  Generating Natural Language Adversarial Examples , 2018, EMNLP.

[42]  Emily M. Bender,et al.  Towards Linguistically Generalizable NLP Systems: A Workshop and Shared Task , 2017, Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems.

[43]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[44]  A. Burchardt,et al.  Multidimensional Quality Metrics (MQM): A Framework for Declaring and Describing Translation Quality Metrics , 2014 .

[45]  Timothy Baldwin,et al.  Continuous Measurement Scales in Human Evaluation of Machine Translation , 2013, LAW@ACL.

[46]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.