The Flores-101 Evaluation Benchmark for Low-Resource and Multilingual Machine Translation

One of the biggest challenges hindering progress in low-resource and multilingual machine translation is the lack of good evaluation benchmarks. Current evaluation benchmarks either lack good coverage of low-resource languages, consider only restricted domains, or are low quality because they are constructed using semi-automatic procedures. In this work, we introduce the FLORES-101 evaluation benchmark, consisting of 3001 sentences extracted from English Wikipedia and covering a variety of different topics and domains. These sentences have been translated in 101 languages by professional translators through a carefully controlled process. The resulting dataset enables better assessment of model quality on the long tail of low-resource languages, including the evaluation of many-to-many multilingual translation systems, as all translations are multilingually aligned. By publicly releasing such a highquality and high-coverage dataset, we hope to foster progress in the machine translation community and beyond.

[1]  Holger Schwenk,et al.  WikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia , 2019, EACL.

[2]  Yachao Li,et al.  Finding Better Subwords for Tibetan Neural Machine Translation , 2021, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[3]  Masao Utiyama,et al.  Introduction of the Asian Language Treebank , 2016, 2016 Conference of The Oriental Chapter of International Committee for Coordination and Standardization of Speech Databases and Assessment Techniques (O-COCOSDA).

[4]  Laura Martinus,et al.  Benchmarking Neural Machine Translation for Southern African Languages , 2019, WNLP@ACL.

[5]  Holger Schwenk,et al.  Beyond English-Centric Multilingual Machine Translation , 2020, J. Mach. Learn. Res..

[6]  Siyang Li,et al.  Low-Resource Machine Translation for Low-Resource Languages: Leveraging Comparable Data, Code-Switching and Compute Resources , 2021, ArXiv.

[7]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[8]  Markus Freitag,et al.  Results of the WMT20 Metrics Shared Task , 2020, WMT.

[9]  Orhan Firat,et al.  Massively Multilingual Neural Machine Translation , 2019, NAACL.

[10]  Graham Neubig,et al.  TICO-19: the Translation Initiative for Covid-19 , 2020, NLP4COVID@EMNLP.

[11]  Mans Hulden,et al.  The Usefulness of Bibles in Low-Resource Machine Translation , 2021, COMPUTEL.

[12]  El Moatez Billah Nagoudi,et al.  IndT5: A Text-to-Text Transformer for 10 Indigenous Languages , 2021, AMERICASNLP.

[13]  Miikka Silfverberg,et al.  Expanding the JHU Bible Corpus for Machine Translation of the Indigenous Languages of North America , 2021, COMPUTEL.

[14]  Bonaventure F. P. Dossou,et al.  Crowdsourced Phrase-Based Tokenization for Low-Resourced Neural Machine Translation: The Case of Fon Language , 2021, ArXiv.

[15]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[16]  Philipp Koehn,et al.  Findings of the 2017 Conference on Machine Translation (WMT17) , 2017, WMT.

[17]  Jörg Tiedemann,et al.  Emerging Language Spaces Learned From Massively Multilingual Corpora , 2018, DHN.

[18]  Mikel L. Forcada,et al.  ParaCrawl: Web-scale parallel corpora for the languages of the EU , 2019, MTSummit.

[19]  Martin Wattenberg,et al.  Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.

[20]  Karin M. Verspoor,et al.  Findings of the 2016 Conference on Machine Translation , 2016, WMT.

[21]  Hady Elsahar,et al.  Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages , 2020, FINDINGS.

[22]  Bonaventure F. P. Dossou,et al.  FFR v1.1: Fon-French Neural Machine Translation , 2020, WINLP.

[23]  Masao Utiyama,et al.  Similar Southeast Asian Languages: Corpus-Based Case Study on Thai-Laotian and Malay-Indonesian , 2016, WAT@COLING.

[24]  Philipp Koehn,et al.  Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English , 2019, ArXiv.

[25]  Andreas Nürnberger,et al.  Extended Parallel Corpus for Amharic-English Machine Translation , 2021, ArXiv.

[26]  Markus Freitag,et al.  Complete Multilingual Neural Machine Translation , 2020, WMT.

[27]  Boosting Neural Machine Translation from Finnish to Northern Sámi with Rule-Based Backtranslation , 2021, NODALIDA.

[28]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[29]  Shweta Chauhan,et al.  Monolingual and Parallel Corpora for Kangri Low Resource Language , 2021, ArXiv.

[30]  Kemal Oflazer,et al.  A Human Judgement Corpus and a Metric for Arabic MT Evaluation , 2014, EMNLP.

[31]  Marta R. Costa-jussà,et al.  Findings of the 2019 Conference on Machine Translation (WMT19) , 2019, WMT.

[32]  Antonio Toral,et al.  The Effect of Translationese in Machine Translation Test Sets , 2019, WMT.

[33]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[34]  Mark Steedman,et al.  A massively parallel corpus: the Bible in 100 languages , 2014, Lang. Resour. Evaluation.

[35]  Jan Niehues,et al.  Unsupervised Machine Translation On Dravidian Languages , 2021, DRAVIDIANLANGTECH.

[36]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[37]  Stephanie Strassel,et al.  LORELEI Language Packs: Data, Tools, and Resources for Technology Development in Low Resource Languages , 2016, LREC.

[38]  Philipp Koehn,et al.  Findings of the 2018 Conference on Machine Translation (WMT18) , 2018, WMT.

[39]  Masao Utiyama,et al.  Introducing the Asian Language Treebank (ALT) , 2016, LREC.

[40]  Rico Sennrich,et al.  Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation , 2020, ACL.

[41]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[42]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[43]  Bruce A. Bassett,et al.  Low-Resource Neural Machine Translation for Southern African Languages , 2021, ArXiv.

[44]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[45]  Andrew Caines,et al.  Towards a parallel corpus of Portuguese and the Bantu language Emakhuwa of Mozambique , 2021, ArXiv.

[46]  Vishrav Chaudhary,et al.  CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data , 2019, LREC.

[47]  Philipp Koehn,et al.  Findings of the 2020 Conference on Machine Translation (WMT20) , 2020, WMT.

[48]  David Ifeoluwa Adelani,et al.  MENYO-20k: A Multi-domain English-Yorùbá Corpus for Machine Translation and Domain Adaptation , 2020, ArXiv.

[49]  Philipp Koehn,et al.  A Massive Collection of Cross-Lingual Web-Document Pairs , 2019, EMNLP.

[50]  Marcos Zampieri,et al.  Domain-specific MT for Low-resource Languages: The case of Bambara-French , 2021, ArXiv.

[51]  Holger Schwenk,et al.  CCMatrix: Mining Billions of High-Quality Parallel Sentences on the WEB , 2019, ArXiv.

[52]  Guillaume Lample,et al.  Phrase-Based & Neural Unsupervised Machine Translation , 2018, EMNLP.

[53]  Yuqing Tang,et al.  Multilingual Translation with Extensible Multilingual Pretraining and Finetuning , 2020, ArXiv.

[54]  Graham Neubig,et al.  Learning Language Representations for Typology Prediction , 2017, EMNLP.

[55]  Paul Rayson,et al.  Igbo-English Machine Translation: An Evaluation Benchmark , 2020, ArXiv.

[56]  Eneko Agirre,et al.  Unsupervised Statistical Machine Translation , 2018, EMNLP.

[57]  Graham Neubig,et al.  When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation? , 2018, NAACL.