Moses and the Character-Based Random Babbling Baseline: CoAStaL at AmericasNLP 2021 Shared Task

We evaluated a range of neural machine translation techniques developed specifically for low-resource scenarios. Unsuccessfully. In the end, we submitted two runs: (i) a standard phrase-based model, and (ii) a random babbling baseline using character trigrams. We found that it was surprisingly hard to beat (i), in spite of this model being, in theory, a bad fit for polysynthetic languages; and more interestingly, that (ii) was better than several of the submitted systems, highlighting how difficult low-resource machine translation for polysynthetic languages is.

[1]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[2]  Shashi Narayan,et al.  Leveraging Pre-trained Checkpoints for Sequence Generation Tasks , 2019, Transactions of the Association for Computational Linguistics.

[3]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[4]  Luis Chiruzzo,et al.  Development of a Guarani - Spanish Parallel Corpus , 2020, LREC.

[5]  Iván V. Meza,et al.  Probabilistic Finite-State morphological segmenter for Wixarika (huichol) language , 2018, J. Intell. Fuzzy Syst..

[6]  Mark Steedman,et al.  A massively parallel corpus: the Bible in 100 languages , 2014, Lang. Resour. Evaluation.

[7]  Laurent Romary,et al.  A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages , 2020, ACL.

[8]  Katharina Kann,et al.  Lost in Translation: Analysis of Information Loss During Machine Translation Between Polysynthetic and Fusional Languages , 2018, ArXiv.

[9]  John Morgan,et al.  Challenges in Speech Recognition and Translation of High-Value Low-Density Polysynthetic Languages , 2018, AMTA.

[10]  Katharina Kann,et al.  Findings of the AmericasNLP 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas , 2021, AMERICASNLP.

[11]  Khalil Sima'an,et al.  Graph Convolutional Encoders for Syntax-aware Neural Machine Translation , 2017, EMNLP.

[12]  Jason Lee,et al.  Fully Character-Level Neural Machine Translation without Explicit Segmentation , 2016, TACL.

[13]  John Ortega,et al.  Overcoming Resistance: The Normalization of an Amazonian Tribal Language , 2020, LORESMT.

[14]  Sampo Pyysalo,et al.  Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection , 2020, LREC.

[15]  Yonatan Belinkov,et al.  What do Neural Machine Translation Models Learn about Morphology? , 2017, ACL.

[16]  Alexandra Birch,et al.  Language Model Prior for Low-Resource Neural Machine Translation , 2020, EMNLP.

[17]  Stelios Piperidis,et al.  Parallel Global Voices: a Collection of Multilingual Corpora with Citizen Media Stories , 2016, LREC.

[18]  Katharina Kann,et al.  AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages , 2021, ArXiv.

[19]  Gerardo Sierra,et al.  Axolotl: a Web Accessible Parallel Corpus for Spanish-Nahuatl , 2016, LREC.

[20]  Arturo Oncevay-Marcos,et al.  Corpus Creation and Initial SMT Experiments between Spanish and Shipibo-konibo , 2017, RANLP.

[21]  Philipp Koehn,et al.  Findings of the 2020 Conference on Machine Translation (WMT20) , 2020, WMT.

[22]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[23]  William Bright,et al.  Diccionario raramuri-castellano (Tarahumar) , 1981 .

[24]  Zeljko Agic,et al.  JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages , 2019, ACL.

[25]  Rolando Coto-Solano,et al.  Neural Machine Translation Models with Back-Translation for the Extremely Low-Resource Indigenous Language Bribri , 2020, COLING.

[26]  Christopher D. Manning,et al.  Stanza: A Python Natural Language Processing Toolkit for Many Human Languages , 2020, ACL.