论文信息 - Metric Score Landscape Challenge (MSLC23): Understanding Metrics’ Performance on a Wider Landscape of Translation Quality - 字舞流文

Metric Score Landscape Challenge (MSLC23): Understanding Metrics’ Performance on a Wider Landscape of Translation Quality

The Metric Score Landscape Challenge (MSLC23) dataset aims to gain insight into metric scores on a broader/wider landscape of machine translation (MT) quality. It provides a collection of low- to medium-quality MT output on the WMT23 general task test set. Together with the high quality systems submitted to the general task, this will enable better interpretation of metric scores across a range of different levels of translation quality. With this wider range of MT quality, we also visualize and analyze metric characteristics beyond just correlation.

Samuel Larkin | Chi-kiu Lo | Rebecca Knowles

[1] Luísa Coheur,et al. xCOMET: Transparent Machine Translation Evaluation through Fine-grained Error Detection , 2023, ArXiv.

[2] Shujian Huang,et al. BLEURT Has Universal Translations: An Analysis of Automatic Metrics by Minimum Risk Training , 2023, ACL.

[3] Mark Steedman,et al. Extrinsic Evaluation of Machine Translation Metrics , 2022, ACL.

[4] Liane Guillou,et al. ACES: Translation Accuracy Challenge Sets for Evaluating Machine Translation Metrics , 2022, WMT.

[5] Shannon L. Spruit,et al. No Language Left Behind: Scaling Human-Centered Machine Translation , 2022, ArXiv.

[6] Marcin Junczys-Dowmunt,et al. To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation , 2021, WMT.

[7] Eleftherios Avramidis,et al. Fine-grained linguistic evaluation for state-of-the-art Machine Translation , 2020, WMT.

[8] Nitika Mathur,et al. Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics , 2020, ACL.

[9] Thibault Sellam,et al. BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.

[10] Matt Post,et al. Automatic Machine Translation Evaluation in Many Languages via Zero-Shot Paraphrasing , 2020, EMNLP.

[11] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[12] Chi-kiu Lo,et al. YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources , 2019, WMT.

[13] Taku Kudo,et al. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[14] Hans Uszkoreit,et al. TQ-AutoTest – An Automated Test Suite for (Machine) Translation Quality , 2018, LREC.

[15] Matt Post,et al. A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[16] Ondrej Bojar,et al. Results of the WMT17 Metrics Shared Task , 2017, WMT.

[17] Philipp Koehn,et al. Findings of the 2017 Conference on Machine Translation (WMT17) , 2017, WMT.

[18] Timothy Baldwin,et al. Is all that Glitters in Machine Translation Quality Estimation really Gold? , 2016, COLING.

[19] Rico Sennrich,et al. Edinburgh Neural Machine Translation Systems for WMT 16 , 2016, WMT.

[20] Maja Popovic,et al. chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[21] Timothy Baldwin,et al. Testing for Significance of Increased Correlation with Human Judgment , 2014, EMNLP.

[22] Timothy Baldwin,et al. Is Machine Translation Getting Better over Time? , 2014, EACL.

[23] Timothy Baldwin,et al. Continuous Measurement Scales in Human Evaluation of Machine Translation , 2013, LAW@ACL.

[24] Gregory A. Sanders,et al. The NIST 2008 Metrics for machine translation challenge—overview, methodology, metrics, and results , 2009, Machine Translation.

[25] Philipp Koehn,et al. (Meta-) Evaluation of Machine Translation , 2007, WMT@ACL.

[26] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[27] Chi-kiu Lo,et al. Beyond Correlation: Making Sense of the Score Differences of New MT Evaluation Metrics , 2023, MTSUMMIT.

[28] Sören Dreano,et al. Embed_Llama: Using LLM Embeddings for the Metrics Shared Task , 2023, WMT.

[29] Tom Kocmi,et al. Cometoid: Distilling Strong Reference-based Machine Translation Metrics into Even Stronger Quality Estimation Metrics , 2023, WMT.

[30] Daniel Deutsch,et al. Ties Matter: Modifying Kendall's Tau for Modern Metric Meta-Evaluation , 2023, ArXiv.

[31] Aditya Siddhant,et al. MetricX-23: The Google Submission to the WMT 2023 Metrics Shared Task , 2023, WMT.

[32] Shimin Tao,et al. Empowering a Metric with LLM-assisted Named Entity Annotation: HW-TSC’s Submission to the WMT23 Metrics Shared Task , 2023, WMT.

[33] Manish Shrivastava,et al. MEE4 and XLsim : IIIT HYD’s Submissions’ for WMT23 Metrics Shared Task , 2023, WMT.

[34] Subhajit Naskar,et al. Quality Estimation Using Minimum Bayes Risk , 2023, WMT.

[35] George F. Foster,et al. Results of WMT22 Metrics Shared Task: Stop Using BLEU – Neural Metrics Are Better and More Robust , 2022, WMT.

[36] Alessandro Sciré,et al. MaTESe: Machine Translation Evaluation as a Sequence Tagging Problem , 2022, WMT.

[37] Eleftherios Avramidis,et al. Linguistically Motivated Evaluation of Machine Translation Metrics Based on a Challenge Set , 2022, WMT.

[38] C. Federmann,et al. MS-COMET: More and Better Human Judgements Improve Metric Performance , 2022, WMT.

[39] Hao Yang,et al. Exploring Robustness of Machine Translation Metrics: A Study of Twenty-Two Automatic Metrics in the WMT22 Metric Task , 2022, WMT.

[40] André F. T. Martins,et al. Robust MT Evaluation with Sentence-level Multilingual Augmentation , 2022, WMT.

[41] A. Lavie,et al. COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task , 2022, WMT.

[42] A. Lavie,et al. Results of the WMT21 Metrics Shared Task: Evaluating Metrics with Expert-based Human Evaluations on TED and News Domain , 2021, WMT.

[43] Mattia Antonino Di Gangi,et al. FBK’s Neural Machine Translation Systems for IWSLT 2016 , 2016, IWSLT.

[44] A. Lommel. Multidimensional Quality Metrics : A Flexible System for Assessing Translation Quality , 2013 .