Peru is Multilingual, Its Machine Translation Should Be Too?

Peru is a multilingual country with a long history of contact between the indigenous languages and Spanish. Taking advantage of this context for machine translation is possible with multilingual approaches for learning both unsupervised subword segmentation and neural machine translation models. The study proposes the first multilingual translation models for four languages spoken in Peru: Aymara, Ashaninka, Quechua and Shipibo-Konibo, providing both many-to-Spanish and Spanish-to-many models and outperforming pairwise baselines in most of them. The task exploited a large English-Spanish dataset for pre-training, monolingual texts with tagged back-translation, and parallel corpora aligned with English. Finally, by fine-tuning the best models, we also assessed the out-of-domain capabilities in two evaluation datasets for Quechua and a new one for Shipibo-Konibo.

[1]  Arturo Oncevay-Marcos,et al.  Corpus Creation and Initial SMT Experiments between Spanish and Shipibo-konibo , 2017, RANLP.

[2]  John Ortega,et al.  Overcoming Resistance: The Normalization of an Amazonian Tribal Language , 2020, LORESMT.

[3]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[4]  Chenhui Chu,et al.  A Comprehensive Survey of Multilingual Neural Machine Translation , 2020, ArXiv.

[5]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[6]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[7]  Martin Wattenberg,et al.  Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.

[8]  Maja Popovic,et al.  chrF: character n-gram F-score for automatic MT evaluation , 2015, WMT@EMNLP.

[9]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[10]  Graham Neubig,et al.  Rapid Adaptation of Neural Machine Translation to New Languages , 2018, EMNLP.

[11]  Ondrej Bojar,et al.  Trivial Transfer Learning for Low-Resource Neural Machine Translation , 2018, WMT.

[12]  Ankur Bapna,et al.  Simple, Scalable Adaptation for Neural Machine Translation , 2019, EMNLP.

[13]  Harald Hammarström,et al.  Obsolescencia lingüística, descripción gramatical y documentación de lenguas en el Perú: hacia un estado de la cuestión , 2019, Lexis.

[14]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[15]  Roberto Zariquiey,et al.  No Data to Crawl? Monolingual Corpus Creation from PDF Files of Truly low-Resource Languages in Peru , 2020, LREC.

[16]  John Ortega,et al.  Using Morphemes from Agglutinative Languages like Quechua and Finnish to Aid in Low-Resource Translation , 2018, LoResMT@AMTA.

[17]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[18]  Taro Watanabe,et al.  Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection , 2018, WMT.

[19]  Zeljko Agic,et al.  JW300: A Wide-Coverage Parallel Corpus for Low-Resource Languages , 2019, ACL.

[20]  Tao Qin,et al.  Multilingual Neural Machine Translation with Language Clustering , 2019, EMNLP.

[21]  Kyunghyun Cho,et al.  Neural machine translation with a polysynthetic low resource language , 2020, Machine Translation.

[22]  Petr Homola,et al.  Rule-based machine translation for Aymara , 2014 .

[23]  Francis M. Tyers,et al.  Apertium: a free/open-source platform for rule-based machine translation , 2011, Machine Translation.

[24]  Taku Kudo,et al.  Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates , 2018, ACL.

[25]  Alexandra Birch,et al.  Bridging Linguistic Typology and Multilingual Machine Translation with Multi-view Language Representations , 2020, EMNLP.

[26]  Graham Neubig,et al.  Balancing Training for Multilingual Neural Machine Translation , 2020, ACL.

[27]  Alexandra Birch,et al.  The University of Edinburgh's English-Tamil and English-Inuktitut Submissions to the WMT20 News Translation Task , 2020, WMT@EMNLP.

[28]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[29]  Ciprian Chelba,et al.  Tagged Back-Translation , 2019, WMT.

[30]  Arturo Oncevay,et al.  A Continuous Improvement Framework of Machine Translation for Shipibo-Konibo , 2019 .