论文信息 - Moses and the Character-Based Random Babbling Baseline: CoAStaL at AmericasNLP 2021 Shared Task - 字舞流文

Moses and the Character-Based Random Babbling Baseline: CoAStaL at AmericasNLP 2021 Shared Task

We evaluated a range of neural machine translation techniques developed specifically for low-resource scenarios. Unsuccessfully. In the end, we submitted two runs: (i) a standard phrase-based model, and (ii) a random babbling baseline using character trigrams. We found that it was surprisingly hard to beat (i), in spite of this model being, in theory, a bad fit for polysynthetic languages; and more interestingly, that (ii) was better than several of the submitted systems, highlighting how difficult low-resource machine translation for polysynthetic languages is.

Miryam de Lhoneux | Anders Søgaard | Rahul Aralikatte | Marcel Bollmann | Daniel Hershcovich | Héctor Murrieta Bello | Miryam de Lhoneux | Anders Søgaard | Daniel Hershcovich | Rahul Aralikatte | Marcel Bollmann | Héctor Murrieta Bello

[1] Rico Sennrich,et al. Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[2] Shashi Narayan,et al. Leveraging Pre-trained Checkpoints for Sequence Generation Tasks , 2019, Transactions of the Association for Computational Linguistics.

[3] Jörg Tiedemann,et al. Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[4] Luis Chiruzzo,et al. Development of a Guarani - Spanish Parallel Corpus , 2020, LREC.

[5] Iván V. Meza,et al. Probabilistic Finite-State morphological segmenter for Wixarika (huichol) language , 2018, J. Intell. Fuzzy Syst..

[6] Mark Steedman,et al. A massively parallel corpus: the Bible in 100 languages , 2014, Lang. Resour. Evaluation.

[7] Laurent Romary,et al. A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages , 2020, ACL.

[8] Katharina Kann,et al. Lost in Translation: Analysis of Information Loss During Machine Translation Between Polysynthetic and Fusional Languages , 2018, ArXiv.

[9] John Morgan,et al. Challenges in Speech Recognition and Translation of High-Value Low-Density Polysynthetic Languages , 2018, AMTA.

[10] Katharina Kann,et al. Findings of the AmericasNLP 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas , 2021, AMERICASNLP.

[11] Khalil Sima'an,et al. Graph Convolutional Encoders for Syntax-aware Neural Machine Translation , 2017, EMNLP.

[12] Jason Lee,et al. Fully Character-Level Neural Machine Translation without Explicit Segmentation , 2016, TACL.

[13] John Ortega,et al. Overcoming Resistance: The Normalization of an Amazonian Tribal Language , 2020, LORESMT.

[14] Sampo Pyysalo,et al. Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection , 2020, LREC.

[15] Yonatan Belinkov,et al. What do Neural Machine Translation Models Learn about Morphology? , 2017, ACL.

[16] Alexandra Birch,et al. Language Model Prior for Low-Resource Neural Machine Translation , 2020, EMNLP.

[17] Stelios Piperidis,et al. Parallel Global Voices: a Collection of Multilingual Corpora with Citizen Media Stories , 2016, LREC.

[18] Katharina Kann,et al. AmericasNLI: Evaluating Zero-shot Natural Language Understanding of Pretrained Multilingual Models in Truly Low-resource Languages , 2021, ArXiv.

[19] Gerardo Sierra,et al. Axolotl: a Web Accessible Parallel Corpus for Spanish-Nahuatl , 2016, LREC.

[20] Arturo Oncevay-Marcos,et al. Corpus Creation and Initial SMT Experiments between Spanish and Shipibo-konibo , 2017, RANLP.

[21] Philipp Koehn,et al. Findings of the 2020 Conference on Machine Translation (WMT20) , 2020, WMT.

[22] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[23] William Bright,et al. Diccionario raramuri-castellano (Tarahumar) , 1981 .

[24] Zeljko Agic,et al. JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages , 2019, ACL.

[25] Rolando Coto-Solano,et al. Neural Machine Translation Models with Back-Translation for the Extremely Low-Resource Indigenous Language Bribri , 2020, COLING.

[26] Christopher D. Manning,et al. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages , 2020, ACL.