Challenges of language technologies for the indigenous languages of the Americas

Indigenous languages of the American continent are highly diverse. However, they have received little attention from the technological perspective. In this paper, we review the research, the digital resources and the available NLP systems that focus on these languages. We present the main challenges and research questions that arise when distant languages and low-resource scenarios are faced. We would like to encourage NLP research in linguistically rich and diverse areas like the Americas.

[1]  Gerardo Sierra,et al.  Axolotl: a Web Accessible Parallel Corpus for Spanish-Nahuatl , 2016, LREC.

[2]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[3]  Kimmo Kettunen,et al.  Can Type-Token Ratio be Used to Show Morphological Complexity of Languages?* , 2014, J. Quant. Linguistics.

[4]  Arturo Oncevay-Marcos,et al.  Corpus Creation and Initial SMT Experiments between Spanish and Shipibo-konibo , 2017, RANLP.

[5]  Rolando Coto-Solano,et al.  Alineación forzada sin entrenamiento para la anotación automática de corpus orales de las lenguas indígenas de Costa Rica , 2017 .

[6]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[7]  Oskar Kohonen,et al.  Semi-Supervised Learning of Concatenative Morphology , 2010, SIGMORPHON.

[8]  Lyle Campbell,et al.  Ethnologue: Languages of the world (review) , 2008 .

[9]  Emily M. Bender Linguistic I Ssues in L Anguage Technology Lilt on Achieving and Evaluating Language-independence in Nlp on Achieving and Evaluating Language-independence in Nlp , 2022 .

[10]  Mans Hulden,et al.  Foma: a Finite-State Compiler and Library , 2009, EACL.

[11]  Annette Rios,et al.  A Basic Language Technology Toolkit for Quechua , 2015, Proces. del Leng. Natural.

[12]  Dayana Iguarán Fernández,et al.  Design and implementation of an “Web API” for the automatic translation Colombia's language pairs: Spanish-Wayuunaiki case , 2013, 2013 IEEE Colombian Conference on Communications and Computing (COLCOM).

[13]  Wolgemuth Walters Diccionario Náhuatl : de los municipios de Mecayapan y Tatahuicapan de Juárez, Veracruz , 2000 .

[14]  Flor Cagniy Cárdenas Mariño,et al.  Analizador morfológico de la lengua quechua basado en software libre helsinkifinite-statetransducer (hfst) , 2013 .

[15]  Mathias Creutz,et al.  Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0 , 2005 .

[16]  Katharina Kann,et al.  Fortification of Neural Morphological Segmentation Models for Polysynthetic Minimal-Resource Languages , 2018, NAACL.

[17]  Ximena Gutierrez-Vasques Bilingual lexicon extraction for a distant language pair using a small parallel corpus , 2015, HLT-NAACL.

[18]  Hermann Ney,et al.  Towards the Use of Word Stems and Suffixes for Statistical Machine Translation , 2004, LREC.

[19]  Ben King,et al.  Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods , 2013, NAACL.

[20]  Arturo Oncevay-Marcos,et al.  A Low-Resourced Peruvian Language Identification Model , 2017, SIMBig.

[21]  Diego Manuel Maldonado,et al.  Eñe’˜e: Sistema de reconocimiento automático del habla en Guaraní , 2016 .

[22]  H. Christoph Wolfart,et al.  Computer-Assisted Linguistic Analysis , 1973 .

[23]  Christian Bentz,et al.  A Comparison Between Morphological Complexity Measures: Typological Data vs. Language Corpora , 2016, CL4LC@COLING 2016.

[24]  Marco Antonio Sobrevilla Cabezudo,et al.  Ship-LemmaTagger: Building an NLP Toolkit for a Peruvian Native Language , 2017, TSD.

[25]  Iván V. Meza,et al.  Probabilistic Finite-State morphological segmenter for Wixarika (huichol) language , 2018, J. Intell. Fuzzy Syst..

[26]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[27]  Armin Hoenen Wikipedia Titles As Noun Tag Predictors , 2016, LREC.

[28]  Adam Lopez,et al.  From Characters to Words to in Between: Do We Capture Morphology? , 2017, ACL.

[29]  Jesús Manuel Mager Hois,et al.  Traductor estadístico wixarika - español usando descomposición morfológica , 2016 .

[30]  Yonatan Belinkov,et al.  Understanding and Improving Morphological Learning in the Neural Machine Translation Decoder , 2017, IJCNLP.

[31]  Francis M. Tyers,et al.  Apertium: a free/open-source platform for rule-based machine translation , 2011, Machine Translation.

[32]  Damir Cavar,et al.  Endangered Language Documentation: Bootstrapping a Chatino Speech Corpus, Forced Aligner, ASR , 2016, LREC.

[33]  Mathias Creutz,et al.  Unsupervised Discovery of Morphemes , 2002, SIGMORPHON.

[34]  Enrique L. Palancar,et al.  Oto-Manguean Inflectional Class Database , 2015 .

[35]  José A. R. Fonollosa,et al.  Character-based Neural Machine Translation , 2016, ACL.

[36]  Alfonso Medina-Urrea Affix Discovery by Means of Corpora: Experiments for Spanish, Czech, Ralámuli and Chuj , 2007 .

[37]  Martin Volk,et al.  A Quechua-Spanish parallel treebank , 2008 .

[38]  Marianne Mithun Polysynthesis in North America , 2017 .

[39]  Ryan Cotterell,et al.  The SIGMORPHON 2016 Shared Task—Morphological Reinflection , 2016, SIGMORPHON.

[40]  Jörg Tiedemann,et al.  Emerging Language Spaces Learned From Massively Multilingual Corpora , 2018, DHN.

[41]  Katharina Kann,et al.  The LMU System for the CoNLL-SIGMORPHON 2017 Shared Task on Universal Morphological Reinflection , 2017, CoNLL.

[42]  Claudio Wagner Las lenguas indígenas de América (lenguas amerindias) , 2016 .

[43]  Alon Lavie,et al.  Data Collection and Analysis of Mapudungun Morphology for Spelling Correction , 2004, LREC.

[44]  Katharina Kann,et al.  MED: The LMU System for the SIGMORPHON 2016 Shared Task on Morphological Reinflection , 2016, SIGMORPHON.

[45]  Wanying Jin,et al.  Guarani: A case study in resour , 2006 .

[46]  M. Kinkade,et al.  The Languages of Native North America , 2000 .

[47]  N. A. Mcquown THE INDIGENOUS LANGUAGES OF LATIN AMERICA , 1955 .

[48]  Iván V. Meza,et al.  Hacia la traducción automática de las lenguas indígenas de México , 2018, DH.

[49]  Petr Homola,et al.  Rule-based machine translation for Aymara , 2014 .

[50]  P. Lewis Ethnologue : languages of the world , 2009 .

[51]  Andres Osvaldo Porta The Use of Formal Language Models in the Typology of the Morphology of Amerindian Languages , 2010, ACL.

[52]  Carlo Alva,et al.  Spell-Checking based on Syllabification and Character-level Graphs for a Peruvian Agglutinative Language , 2017, SWCN@EMNLP.

[53]  Ryan Cotterell,et al.  CoNLL-SIGMORPHON 2017 Shared Task: Universal Morphological Reinflection in 52 Languages , 2017, CoNLL.

[54]  Ivan Vulic,et al.  Survey on the Use of Typological Information in Natural Language Processing , 2016, COLING.

[55]  Jeffrey Micher Improving Coverage of an Inuktitut Morphological Analyzer Using a Segmental Recurrent Neural Network , 2017 .

[56]  Manfred K. Warmuth,et al.  THE CMU SPHINX-4 SPEECH RECOGNITION SYSTEM , 2001 .

[57]  Alfonso Medina Urrea,et al.  Towards the Speech Synthesis of Raramuri: A Unit Selection Approach based on Unsupervised Extraction of Suffix Sequences , 2009 .

[58]  Mathias Creutz,et al.  Morphology-aware statistical machine translation based on morphs induced in an unsupervised manner , 2007, MTSUMMIT.

[59]  Antti Arppe,et al.  Converting a comprehensive lexical database into a computational model: The case of East Cree verb inflection , 2017 .

[60]  Lene Antonsen,et al.  Learning from the computational modelling of Plains Cree verbs , 2017, Morphology.

[61]  Lori Levin,et al.  Design and implementation of controlled elicitation for machine translation of low-density languages , 2001, MTSUMMIT.

[62]  Simon Clematide,et al.  Align and Copy: UZH at SIGMORPHON 2017 Shared Task for Morphological Reinflection , 2017, CoNLL.

[63]  Hinrich Schütze,et al.  Past, Present, Future: A Computational Investigation of the Typology of Tense in 1000 Languages , 2017, EMNLP.

[64]  Fray Andrés de Olmos,et al.  Arte de la lengua mexicana , 2002 .

[65]  Edward J. Vajda,et al.  The languages of native North America , 1999 .

[66]  Francis M. Tyers,et al.  The Apertium machine translation platform: five years on , 2009 .

[67]  Alon Lavie,et al.  Building NLP Systems for Two Resource-Scarce Indigenous Languages : Mapudungun and Quechua , 2006 .

[68]  Alfonso Medina-Urrea Affix Discovery based on Entropy and Economy Measurements , 2008 .

[69]  Lori Levin,et al.  Data Collection and Language Technologies for Mapudungun , 2002 .

[70]  Michael Maxwell,et al.  Endangered Data for Endangered Languages: Digitizing Print dictionaries , 2017 .

[71]  Gina Cook,et al.  LingSync & the Online Linguistic Database: New Models for the Collection and Management of Data for Language Communities, Linguists and Language Learners , 2014 .

[72]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[73]  Petr Homola,et al.  Parsing a Polysynthetic Language , 2011, RANLP.

[74]  Alicia Alexandra Assini Natural Language Processing and the Mohawk Language: Creating a finite state morphological parser of Mohawk formal nouns , 2014 .

[75]  Trond Trosterud,et al.  Modeling the Noun Morphology of Plains Cree , 2014 .

[76]  Dan Garrette,et al.  An Unsupervised Model of Orthographic Variation for Historical Document Transcription , 2016, NAACL.

[77]  Sjur Moshagen,et al.  A Morphological Parser for Odawa , 2017 .

[78]  Thomas Mayer,et al.  Creating a massively parallel Bible corpus , 2014, LREC.

[79]  J. Bresnan Lexical-Functional Syntax , 2000 .