论文信息 - The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

One of the biggest challenges hindering progress in low-resource and multilingual machine translation is the lack of good evaluation benchmarks. Current evaluation benchmarks either lack good coverage of low-resource languages, consider only restricted domains, or are low quality because they are constructed using semi-automatic procedures. In this work, we introduce the FLORES-101 evaluation benchmark, consisting of 3001 sentences extracted from English Wikipedia and covering a variety of different topics and domains. These sentences have been translated in 101 languages by professional translators through a carefully controlled process. The resulting dataset enables better assessment of model quality on the long tail of low-resource languages, including the evaluation of many-to-many multilingual translation systems, as all translations are multilingually aligned. By publicly releasing such a highquality and high-coverage dataset, we hope to foster progress in the machine translation community and beyond.

[1] Holger Schwenk,et al. WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia , 2019, EACL.

[2] Yachao Li,et al. Finding Better Subwords for Tibetan Neural Machine Translation , 2021, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[3] Masao Utiyama,et al. Introduction of the Asian Language Treebank , 2016, 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA).

[4] Laura Martinus,et al. Benchmarking Neural Machine Translation for Southern African Languages , 2019, WNLP@ACL.

[5] Holger Schwenk,et al. Beyond English-Centric Multilingual Machine Translation , 2020, J. Mach. Learn. Res..

[6] Siyang Li,et al. Low-Resource Machine Translation for Low-Resource Languages: Leveraging Comparable Data, Code-Switching and Compute Resources , 2021, ArXiv.

[7] Matt Post,et al. A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[8] Markus Freitag,et al. Results of the WMT20 Metrics Shared Task , 2020, WMT.

[9] Orhan Firat,et al. Massively Multilingual Neural Machine Translation , 2019, NAACL.

[10] Graham Neubig,et al. TICO-19: the Translation Initiative for Covid-19 , 2020, NLP4COVID@EMNLP.

[11] Mans Hulden,et al. The Usefulness of Bibles in Low-Resource Machine Translation , 2021, COMPUTEL.

[12] El Moatez Billah Nagoudi,et al. IndT5: A Text-to-Text Transformer for 10 Indigenous Languages , 2021, AMERICASNLP.

[13] Miikka Silfverberg,et al. Expanding the JHU Bible Corpus for Machine Translation of the Indigenous Languages of North America , 2021, COMPUTEL.

[14] Bonaventure F. P. Dossou,et al. Crowdsourced Phrase-Based Tokenization for Low-Resourced Neural Machine Translation: The Case of Fon Language , 2021, ArXiv.

[15] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.