Results of WMT22 Metrics Shared Task: Stop Using BLEU – Neural Metrics Are Better and More Robust
暂无分享,去创建一个
George F. Foster | A. Lavie | Markus Freitag | Eleftherios Avramidis | Tom Kocmi | Chi-kiu Lo | André F. T. Martins | Nitika Mathur | Ricardo Rei | Craig Alan Stewart | Chi-kiu (羅致翹) Lo
[1] Liane Guillou,et al. ACES: Translation Accuracy Challenge Sets for Evaluating Machine Translation Metrics , 2022, WMT.
[2] Lidia S. Chao,et al. Alibaba-Translate China’s Submission for WMT2022 Metrics Shared Task , 2022, WMT.
[3] William Yang Wang,et al. Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis , 2022, EMNLP.
[4] Shannon L. Spruit,et al. No Language Left Behind: Scaling Human-Centered Machine Translation , 2022, ArXiv.
[5] José G. C. de Souza,et al. Quality-Aware Decoding for Neural Machine Translation , 2022, NAACL.
[6] Lidia S. Chao,et al. UniTE: Unified Translation Evaluation , 2022, ACL.
[7] S. Clémençon,et al. What are the best systems? New perspectives on NLP Benchmarking , 2022, ArXiv.
[8] David Grangier,et al. High Quality Rather than High Model Probability: Minimum Bayes Risk Decoding with Neural Metrics , 2021, Transactions of the Association for Computational Linguistics.
[9] Marc'Aurelio Ranzato,et al. The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation , 2021, TACL.
[10] Alessandro Sciré,et al. MaTESe: Machine Translation Evaluation as a Sequence Tagging Problem , 2022, WMT.
[11] Pengfei Li,et al. Partial Could Be Better than Whole. HW-TSC 2022 Submission for the Metrics Shared Task , 2022, WMT.
[12] A. Lavie,et al. COMET-22: Unbabel-IST 2022 Submission for the Metrics Shared Task , 2022, WMT.
[13] Manish Shrivastava,et al. REUSE: REference-free UnSupervised Quality Estimation Metric , 2022, WMT.
[14] Manish Shrivastava,et al. Unsupervised Embedding-based Metric for MT Evaluation with Improved Human Correlation , 2022, WMT.
[15] Hao Yang,et al. Exploring Robustness of Machine Translation Metrics: A Study of Twenty-Two Automatic Metrics in the WMT22 Metric Task , 2022, WMT.
[16] Eleftherios Avramidis,et al. Linguistically Motivated Evaluation of Machine Translation Metrics Based on a Challenge Set , 2022, WMT.
[17] André F. T. Martins,et al. Robust MT Evaluation with Sentence-level Multilingual Augmentation , 2022, WMT.
[18] Philipp Koehn,et al. Findings of the 2022 Conference on Machine Translation (WMT22) , 2022, WMT.
[19] Eleftherios Avramidis,et al. A Linguistically Motivated Test Suite to Semi-Automatically Evaluate German–English Machine Translation Output , 2022, LREC.
[20] Sebastian Gehrmann,et al. Learning Compact Metrics for MT , 2021, EMNLP.
[21] Marcin Junczys-Dowmunt,et al. To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation , 2021, WMT.
[22] Rico Sennrich,et al. Understanding the Properties of Minimum Bayes Risk Decoding in Neural Machine Translation , 2021, ACL.
[23] Markus Freitag,et al. Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation , 2021, Transactions of the Association for Computational Linguistics.
[24] Dan Roth,et al. A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods , 2021, Transactions of the Association for Computational Linguistics.
[25] Holger Schwenk,et al. Beyond English-Centric Multilingual Machine Translation , 2020, J. Mach. Learn. Res..
[26] Alon Lavie,et al. Are References Really Needed? Unbabel-IST 2021 Submission for the Metrics Shared Task , 2021, WMT.
[27] A. Lavie,et al. Results of the WMT21 Metrics Shared Task: Evaluating Metrics with Expert-based Human Evaluations on TED and News Domain , 2021, WMT.
[28] Wilker Aziz,et al. Sampling-Based Minimum Bayes Risk Decoding for Neural Machine Translation , 2021, ArXiv.
[29] Vishrav Chaudhary,et al. Multilingual Translation from Denoising Pre-Training , 2021, FINDINGS.
[30] Manish Shrivastava,et al. MEE : An Automatic Metric for Evaluation Using Embeddings for Machine Translation , 2020, 2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA).
[31] Alon Lavie,et al. COMET: A Neural Framework for MT Evaluation , 2020, EMNLP.
[32] Nitika Mathur,et al. Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics , 2020, ACL.
[33] Wilker Aziz,et al. Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation , 2020, COLING.
[34] Thibault Sellam,et al. BLEURT: Learning Robust Metrics for Text Generation , 2020, ACL.
[35] Yejin Choi,et al. The Curious Case of Neural Text Degeneration , 2019, ICLR.
[36] Kilian Q. Weinberger,et al. BERTScore: Evaluating Text Generation with BERT , 2019, ICLR.
[37] Chi-kiu Lo,et al. YiSi - a Unified Semantic MT Quality Evaluation and Estimation Metric for Languages with Different Levels of Available Resources , 2019, WMT.
[38] Myle Ott,et al. Facebook FAIR’s WMT19 News Translation Task Submission , 2019, WMT.
[39] Taku Kudo,et al. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.
[40] Matt Post,et al. A Call for Clarity in Reporting BLEU Scores , 2018, WMT.
[41] Mani B. Srivastava,et al. Generating Natural Language Adversarial Examples , 2018, EMNLP.
[42] Emily M. Bender,et al. Towards Linguistically Generalizable NLP Systems: A Workshop and Shared Task , 2017, Proceedings of the First Workshop on Building Linguistically Generalizable NLP Systems.
[43] Maja Popovic,et al. chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.
[44] A. Burchardt,et al. Multidimensional Quality Metrics (MQM): A Framework for Declaring and Describing Translation Quality Metrics , 2014 .
[45] Timothy Baldwin,et al. Continuous Measurement Scales in Human Evaluation of Machine Translation , 2013, LAW@ACL.
[46] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.