Metric Score Landscape Challenge (MSLC23): Understanding Metrics’ Performance on a Wider Landscape of Translation Quality

The Metric Score Landscape Challenge (MSLC23) dataset aims to gain insight into metric scores on a broader/wider landscape of machine translation (MT) quality. It provides a collection of low- to medium-quality MT output on the WMT23 general task test set. Together with the high quality systems submitted to the general task, this will enable better interpretation of metric scores across a range of different levels of translation quality. With this wider range of MT quality, we also visualize and analyze metric characteristics beyond just correlation.

[1]  Luísa Coheur,et al.  xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection , 2023, ArXiv.

[2]  Shujian Huang,et al.  BLEURT Has Universal Translations: An Analysis of Automatic Metrics by Minimum Risk Training , 2023, ACL.

[3]  Mark Steedman,et al.  Extrinsic Evaluation of Machine Translation Metrics , 2022, ACL.

[4]  Liane Guillou,et al.  ACES: Translation Accuracy Challenge Sets for Evaluating Machine Translation Metrics , 2022, WMT.

[5]  Shannon L. Spruit,et al.  No Language Left Behind: Scaling Human-Centered Machine Translation , 2022, ArXiv.

[6]  Marcin Junczys-Dowmunt,et al.  To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation , 2021, WMT.

[7]  Eleftherios Avramidis,et al.  Fine-grained linguistic evaluation for state-of-the-art Machine Translation , 2020, WMT.

[8]  Nitika Mathur,et al.  Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics , 2020, ACL.

[9]  Thibault Sellam,et al.  BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[10]  Matt Post,et al.  Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing , 2020, EMNLP.

[11]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[12]  Chi-kiu Lo,et al.  YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources , 2019, WMT.

[13]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[14]  Hans Uszkoreit,et al.  TQ-AutoTest – An Automated Test Suite for (Machine) Translation Quality , 2018, LREC.

[15]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[16]  Ondrej Bojar,et al.  Results of the WMT17 Metrics Shared Task , 2017, WMT.

[17]  Philipp Koehn,et al.  Findings of the 2017 Conference on Machine Translation (WMT17) , 2017, WMT.

[18]  Timothy Baldwin,et al.  Is all that Glitters in Machine Translation Quality Estimation really Gold? , 2016, COLING.

[19]  Rico Sennrich,et al.  Edinburgh Neural Machine Translation Systems for WMT 16 , 2016, WMT.

[20]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[21]  Timothy Baldwin,et al.  Testing for Significance of Increased Correlation with Human Judgment , 2014, EMNLP.

[22]  Timothy Baldwin,et al.  Is Machine Translation Getting Better over Time? , 2014, EACL.

[23]  Timothy Baldwin,et al.  Continuous Measurement Scales in Human Evaluation of Machine Translation , 2013, LAW@ACL.

[24]  Gregory A. Sanders,et al.  The NIST 2008 Metrics for machine translation challenge—overview, methodology, metrics, and results , 2009, Machine Translation.

[25]  Philipp Koehn,et al.  (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.

[26]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[27]  Chi-kiu Lo,et al.  Beyond Correlation: Making Sense of the Score Differences of New MT Evaluation Metrics , 2023, MTSUMMIT.

[28]  Sören Dreano,et al.  Embed_Llama: Using LLM Embeddings for the Metrics Shared Task , 2023, WMT.

[29]  Tom Kocmi,et al.  Cometoid: Distilling Strong Reference-based Machine Translation Metrics into Even Stronger Quality Estimation Metrics , 2023, WMT.

[30]  Daniel Deutsch,et al.  Ties Matter: Modifying Kendall's Tau for Modern Metric Meta-Evaluation , 2023, ArXiv.

[31]  Aditya Siddhant,et al.  MetricX-23: The Google Submission to the WMT 2023 Metrics Shared Task , 2023, WMT.

[32]  Shimin Tao,et al.  Empowering a Metric with LLM-assisted Named Entity Annotation: HW-TSC’s Submission to the WMT23 Metrics Shared Task , 2023, WMT.

[33]  Manish Shrivastava,et al.  MEE4 and XLsim : IIIT HYD’s Submissions’ for WMT23 Metrics Shared Task , 2023, WMT.

[34]  Subhajit Naskar,et al.  Quality Estimation Using Minimum Bayes Risk , 2023, WMT.

[35]  George F. Foster,et al.  Results of WMT22 Metrics Shared Task: Stop Using BLEU – Neural Metrics Are Better and More Robust , 2022, WMT.

[36]  Alessandro Sciré,et al.  MaTESe: Machine Translation Evaluation as a Sequence Tagging Problem , 2022, WMT.

[37]  Eleftherios Avramidis,et al.  Linguistically Motivated Evaluation of Machine Translation Metrics Based on a Challenge Set , 2022, WMT.

[38]  C. Federmann,et al.  MS-COMET: More and Better Human Judgements Improve Metric Performance , 2022, WMT.

[39]  Hao Yang,et al.  Exploring Robustness of Machine Translation Metrics: A Study of Twenty-Two Automatic Metrics in the WMT22 Metric Task , 2022, WMT.

[40]  André F. T. Martins,et al.  Robust MT Evaluation with Sentence-level Multilingual Augmentation , 2022, WMT.

[41]  A. Lavie,et al.  COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task , 2022, WMT.

[42]  A. Lavie,et al.  Results of the WMT21 Metrics Shared Task: Evaluating Metrics with Expert-based Human Evaluations on TED and News Domain , 2021, WMT.

[43]  Mattia Antonino Di Gangi,et al.  FBK’s Neural Machine Translation Systems for IWSLT 2016 , 2016, IWSLT.

[44]  A. Lommel Multidimensional Quality Metrics : A Flexible System for Assessing Translation Quality , 2013 .