Findings of the WMT 2022 Shared Task on Chat Translation

This paper reports the findings of the second edition of the Chat Translation Shared Task. Similarly to the previous WMT 2020 edition, the task consisted of translating bilingual customer support conversational text. However, unlike the previous edition, in which the bilingual data was created from a synthetic monolingual English corpus, this year we used a portion of the newly released Unbabel’s MAIA corpus, which contains genuine bilingual conversations between agents and customers. We also expanded the language pairs to English↔German (en↔de), English↔French (en↔fr), and English↔Brazilian Portuguese (en↔pt-br).Given that the main goal of the shared task is to translate bilingual conversations, participants were encouraged to train and test their models specifically for this environment. In total, we received 18 submissions from 4 different teams. All teams participated in both directions of en↔de. One of the teams also participated in en↔fr and en↔pt-br. We evaluated the submissions with automatic metrics as well as human judgments via Multidimensional Quality Metrics (MQM) on both directions. The official ranking of the systems is based on the overall MQM scores of the participating systems on both directions, i.e. agent and customer.

[1]  A. Lavie,et al.  Business Critical Errors: A Framework for Adaptive Quality Feedback , 2022, Conference of the Association for Machine Translation in the Americas.

[2]  A. Lavie,et al.  Agent and User-Generated Content and its Impact on Customer Support MT , 2022, European Association for Machine Translation Conferences/Workshops.

[3]  Markus Freitag,et al.  Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation , 2021, Transactions of the Association for Computational Linguistics.

[4]  Lei Li,et al.  Autocorrect in the Process of Translation — Multi-task Learning Improves Dialogue Machine Translation , 2021, NAACL.

[5]  Holger Schwenk,et al.  Beyond English-Centric Multilingual Machine Translation , 2020, J. Mach. Learn. Res..

[6]  Markus Freitag,et al.  Findings of the 2021 Conference on Machine Translation (WMT21) , 2021, WMT.

[7]  Helena Moniz,et al.  Project MAIA: Multilingual AI Agent Assistant , 2020, EAMT.

[8]  Alon Lavie,et al.  COMET: A Neural Framework for MT Evaluation , 2020, EMNLP.

[9]  Jingbo Zhu,et al.  Does Multi-Encoder Help? A Case Study on Context-Aware Neural Machine Translation , 2020, ACL.

[10]  Gholamreza Haffari,et al.  Findings of the WMT 2020 Shared Task on Chat Translation , 2020, WMT.

[11]  Bill Byrne,et al.  Taskmaster-1: Toward a Realistic and Diverse Dialog Dataset , 2019, EMNLP.

[12]  Myle Ott,et al.  Facebook FAIR’s WMT19 News Translation Task Submission , 2019, WMT.

[13]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[14]  Lei Zheng,et al.  Texygen: A Benchmarking Platform for Text Generation Models , 2018, SIGIR.

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  Matthijs Douze,et al.  Learning Joint Multilingual Sentence Representations with Neural Machine Translation , 2017, Rep4NLP@ACL.

[17]  Timothy Baldwin,et al.  Can machine translation systems be evaluated by the crowd alone , 2015, Natural Language Engineering.

[18]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[19]  A. Burchardt,et al.  Multidimensional Quality Metrics (MQM): A Framework for Declaring and Describing Translation Quality Metrics , 2014 .

[20]  Timothy Baldwin,et al.  Is Machine Translation Getting Better over Time? , 2014, EACL.

[21]  Timothy Baldwin,et al.  Continuous Measurement Scales in Human Evaluation of Machine Translation , 2013, LAW@ACL.