Towards Language Technology for Mi'kmaq

Mi’kmaq is a polysynthetic Indigenous language spoken primarily in Eastern Canada, on which no prior computational work has focused. In this paper we first construct and analyze a web corpus of Mi’kmaq. We then evaluate several approaches to language modelling for Mi’kmaq, including character-level models that are particularly well-suited to morphologically-rich languages. Preservation of Indigenous languages is particularly important in the current Canadian context; we argue that natural language processing could aid such efforts.

[1]  Graham Neubig,et al.  Cross-Lingual Word Embeddings for Low-Resource Language Modeling , 2017, EACL.

[2]  Ralf D. Brown,et al.  Non-linear Mapping for Improved Identification of 1300+ Languages , 2014, EMNLP.

[3]  Silas Tertius Rand Dictionary of the language of the Micmac Indians : who reside in Nova Scotia, New Brunswick, Prince Edward Island, Cape Breton and Newfoundland , 2007 .

[4]  Asma-na-hi Antoine,et al.  Appendix A: Excerpts from Honouring the Truth, Reconciling for the Future: Summary of the Final Report of the Truth and Reconciliation Commission of Canada , 2018 .

[5]  Adam Kilgarriff,et al.  Getting to Know Your Corpus , 2012, TSD.

[6]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[7]  Marie Battiste,et al.  Micmac Literacy and Cognitive Assimilation. , 1984 .

[8]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[9]  Wonyong Sung,et al.  Character-level language modeling with hierarchical recurrent neural networks , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Silvia Bernardini,et al.  Introducing and evaluating ukWaC , a very large web-derived corpus of English , 2008 .

[11]  A G N,et al.  Bibliographical References , 1965 .

[12]  Kevin P. Scannell The Crúbadán Project: Corpus building for under-resourced languages , 2007 .

[13]  Barry Smyth,et al.  Fact or Fiction: Content Classification for Digital Libraries , 2001, DELOS.

[14]  Serge Sharo Creating General-Purpose Corpora Using Automated Search Engine Queries , 2006 .

[15]  Mikko Kurimo,et al.  Morfessor 2.0: Toolkit for statistical morphological segmentation , 2014, EACL.

[16]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[17]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[18]  Yulia Tsvetkov,et al.  Incorporating Dialectal Variability for Socially Equitable Language Identification , 2017, ACL.

[19]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[20]  Silvia Bernardini,et al.  BootCaT: Bootstrapping Corpora and Terms from the Web , 2004, LREC.

[21]  Adam Kilgarriff,et al.  WebBootCaT: a Web Tool for Instant Corpora , 2006 .

[22]  Jan Pomikálek Removing Boilerplate and Duplicate Content from Web Corpora , 2011 .