Findings of the AmericasNLP 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas

This paper presents the results of the 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas. The shared task featured two independent tracks, and participants submitted machine translation systems for up to 10 indigenous languages. Overall, 8 teams participated with a total of 214 submissions. We provided training sets consisting of data collected from various sources, as well as manually translated sentences for the development and test sets. An official baseline trained on this data was also provided. Team submissions featured a variety of architectures, including both statistical and neural models, and for the majority of languages, many teams were able to considerably improve over the baseline. The best performing systems achieved 12.97 ChrF higher than baseline, when averaged across languages.

[1]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[2]  Zeljko Agic,et al.  JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages , 2019, ACL.

[3]  Gerardo Sierra,et al.  Axolotl: a Web Accessible Parallel Corpus for Spanish-Nahuatl , 2016, LREC.

[4]  Rotem Dror,et al.  The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing , 2018, ACL.

[5]  El Moatez Billah Nagoudi,et al.  IndT5: A Text-to-Text Transformer for 10 Indigenous Languages , 2021, AMERICASNLP.

[6]  Philipp Koehn,et al.  Findings of the 2020 Conference on Machine Translation (WMT20) , 2020, WMT.

[7]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[8]  Stelios Piperidis,et al.  Parallel Global Voices: a Collection of Multilingual Corpora with Citizen Media Stories , 2016, LREC.

[9]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[10]  Katharina Kann,et al.  AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages , 2021, ArXiv.

[11]  Candace Kaleimamoowahinekapu Galla,et al.  Indigenous language revitalization, promotion, and education: function of digital technology , 2016 .

[12]  Petr Homola,et al.  Rule-based machine translation for Aymara , 2014 .

[13]  Oscar Moreno The REPU CS’ Spanish–Quechua Submission to the AmericasNLP 2021 Shared Task on Open Machine Translation , 2021, AMERICASNLP.

[14]  Patrick Littell,et al.  NRC-CNRC Machine Translation Systems for the 2021 AmericasNLP Shared Task , 2021, AMERICASNLP.

[15]  Yutaka Matsuo,et al.  Low-Resource Machine Translation Using Cross-Lingual Language Model Pretraining , 2021, AMERICASNLP.

[16]  Iván V. Meza,et al.  Probabilistic Finite-State morphological segmenter for Wixarika (huichol) language , 2018, J. Intell. Fuzzy Syst..

[17]  Dayana Iguarán Fernández,et al.  Design and implementation of an “Web API” for the automatic translation Colombia's language pairs: Spanish-Wayuunaiki case , 2013, 2013 IEEE Colombian Conference on Communications and Computing (COLCOM).

[18]  Harald Hammarström,et al.  Glottolog/Langdoc: Increasing the visibility of grey literature for low-density languages , 2012, LREC.

[19]  Thomas Mayer,et al.  Creating a massively parallel Bible corpus , 2014, LREC.

[20]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[21]  Rolando Coto-Solano,et al.  Neural Machine Translation Models with Back-Translation for the Extremely Low-Resource Indigenous Language Bribri , 2020, COLING.

[22]  Guillaume Lample,et al.  XNLI: Evaluating Cross-lingual Sentence Representations , 2018, EMNLP.

[23]  Jörg Tiedemann,et al.  The Helsinki submission to the AmericasNLP shared task , 2021, AMERICASNLP.

[24]  Adolfo Constenla Umaña,et al.  Curso básico de bribri , 1998 .

[25]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[26]  Iván V. Meza,et al.  Hacia la traducción automática de las lenguas indígenas de México , 2018, DH.

[27]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[28]  Arturo Oncevay-Marcos,et al.  Corpus Creation and Initial SMT Experiments between Spanish and Shipibo-konibo , 2017, RANLP.

[29]  Ximena Gutierrez-Vasques Bilingual lexicon extraction for a distant language pair using a small parallel corpus , 2015, HLT-NAACL.

[30]  Miryam de Lhoneux,et al.  Moses and the Character-Based Random Babbling Baseline: CoAStaL at AmericasNLP 2021 Shared Task , 2021, AMERICASNLP.

[31]  John Ortega,et al.  Overcoming Resistance: The Normalization of an Amazonian Tribal Language , 2020, LORESMT.

[32]  Shiyue Zhang,et al.  ChrEn: Cherokee-English Machine Translation for Endangered Language Revitalization , 2020, EMNLP.

[33]  Yashvardhan Sharma,et al.  Open Machine Translation for Low Resource South American Languages (AmericasNLP 2021 Shared Task Contribution) , 2021, AMERICASNLP.

[34]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[35]  Luis Chiruzzo,et al.  Development of a Guarani - Spanish Parallel Corpus , 2020, LREC.

[36]  Atsushi Fujita,et al.  A Poor Man’s Translation Memory Using Machine Translation Evaluation Metrics , 2012, AMTA.

[37]  Philipp Koehn,et al.  Two New Evaluation Datasets for Low-Resource Machine Translation: Nepali-English and Sinhala-English , 2019, ArXiv.

[38]  David Yarowsky,et al.  The Johns Hopkins University Bible Corpus: 1600+ Tongues for Typological Exploration , 2020, LREC.

[39]  Gerardo Sierra,et al.  Challenges of language technologies for the indigenous languages of the Americas , 2018, COLING.

[40]  Arturo Oncevay,et al.  A Continuous Improvement Framework of Machine Translation for Shipibo-Konibo , 2019 .

[41]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[42]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[43]  Mark Steedman,et al.  A massively parallel corpus: the Bible in 100 languages , 2014, Lang. Resour. Evaluation.