Ship-LemmaTagger: Building an NLP Toolkit for a Peruvian Native Language

Natural Language Processing deals with the understanding and generation of texts through computer programs. There are many different functionalities used in this area, but among them there are some functions that are the support of the remaining ones. These methods are related to the core processing of the morphology of the language (such as lemmatization) and automatic identification of the part-of-speech tag. Thereby, this paper describes the implementation of a basic NLP toolkit for a new language, focusing in the features mentioned before, and testing them in an own corpus built for the occasion. The obtained results exceeded the expected results and could be used for more complex tasks such as machine translation.

[1]  Srishti Singh,et al.  Statistical Tagger for Bhojpuri (employing Support Vector Machine) , 2015, 2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI).

[2]  Sivaji Bandyopadhyay,et al.  Part of Speech Tagging in Bengali Using Support Vector Machine , 2008, 2008 International Conference on Information Technology.

[3]  Atsushi Fujii,et al.  A lemmatization method for Mongolian and its application to indexing for information retrieval , 2009, Inf. Process. Manag..

[4]  Bipul Syam Purkayastha,et al.  Hidden Markov Model based Part of Speech Tagging for Nepali language , 2015, 2015 International Symposium on Advanced Computing and Communication (ISACC).

[5]  Sampo Pyysalo,et al.  Universal Dependencies v1: A Multilingual Treebank Collection , 2016, LREC.

[6]  Nisheeth Joshi,et al.  Design and Development of a Rule-Based Urdu Lemmatizer , 2016 .

[7]  Karina Natalia Sullón Acosta,et al.  Documento nacional de lenguas originarias del Perú , 2013 .

[8]  Nada Lavrac,et al.  LemmaGen: Multilingual Lemmatisation with Induced Ripple-Down Rules , 2010, J. Univers. Comput. Sci..