论文信息 - Towards Language Technology for Mi'kmaq

Towards Language Technology for Mi'kmaq

Mi’kmaq is a polysynthetic Indigenous language spoken primarily in Eastern Canada, on which no prior computational work has focused. In this paper we first construct and analyze a web corpus of Mi’kmaq. We then evaluate several approaches to language modelling for Mi’kmaq, including character-level models that are particularly well-suited to morphologically-rich languages. Preservation of Indigenous languages is particularly important in the current Canadian context; we argue that natural language processing could aid such efforts.

Paul Cook | Anant Maheshwari | Léo Bouscarrat

[1] Graham Neubig,et al. Cross-Lingual Word Embeddings for Low-Resource Language Modeling , 2017, EACL.

[2] Ralf D. Brown,et al. Non-linear Mapping for Improved Identification of 1300+ Languages , 2014, EMNLP.

[3] Silas Tertius Rand. Dictionary of the language of the Micmac Indians : who reside in Nova Scotia, New Brunswick, Prince Edward Island, Cape Breton and Newfoundland , 2007 .

[4] Asma-na-hi Antoine,et al. Appendix A: Excerpts from Honouring the Truth, Reconciling for the Future: Summary of the Final Report of the Truth and Reconciliation Commission of Canada , 2018 .

[5] Adam Kilgarriff,et al. Getting to Know Your Corpus , 2012, TSD.

[6] Timothy Baldwin,et al. langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[7] Marie Battiste,et al. Micmac Literacy and Cognitive Assimilation. , 1984 .

[8] Alexander M. Rush,et al. Character-Aware Neural Language Models , 2015, AAAI.

[9] Wonyong Sung,et al. Character-level language modeling with hierarchical recurrent neural networks , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10] Silvia Bernardini,et al. Introducing and evaluating ukWaC , a very large web-derived corpus of English , 2008 .

[11] A G N,et al. Bibliographical References , 1965 .