A Maximum Entropy Approach to Kannada Part Of Speech Tagging

Part Of Speech (POS) tagging is the most important preprocessing step in almost all Natural Language Processing (NLP) applications. It is defined as the process of classifying each word in a text with its appropriate part of speech. In this paper, the probabilistic classifier technique of Maximum Entropy model is experimented for the tagging of Kannada sentences. Kannada language is agglutinative, morphologically very rich but resource poor. Hence 51267 words from EMILLE corpus were manually tagged and used as training data. The tagset included 25 tags as defined for Indian languages. The best suited feature set for the language was finalised after rigorous experiments. Data size of 2892 word forms was downloaded from Kannada websites for testing. Accuracy of 81.6% was obtained in the experiments which prove that Maximum Entropy is well suited for Kannada language. General Terms Artificial Intelligence, Natural Language Processing

[1]  Sivaji Bandyopadhyay,et al.  Maximum Entropy Based Bengali Part of Speech Tagging , 2008 .

[2]  Jes Us Gim Enez And Llu Fast and Accurate Part{of{speech Tagging: the Svm Approach Revisited , 2003 .

[3]  Helmut Schmid,et al.  Part-of-Speech Tagging With Neural Networks , 1994, COLING.

[4]  Mitchell P. Marcus,et al.  Maximum entropy models for natural language ambiguity resolution , 1998 .

[5]  Penelope Sibun,et al.  A Practical Part-of-Speech Tagger , 1992, ANLP.

[6]  Avinesh Pvs,et al.  Part-Of-Speech Tagging and Chunking using Conditional Random Fields and Transformation Based Learning , 2006 .

[7]  Kalina Bontcheva,et al.  Corpus Linguistics and South Asian Languages: Corpus Creation and Tool Development , 2004, Lit. Linguistic Comput..

[8]  K. P. Soman,et al.  Kernel based part of speech tagger for Kannada , 2010, 2010 International Conference on Machine Learning and Cybernetics.

[9]  ABOUT IIT BOMBAY & , 2022 .

[10]  Dipti Misra Sharma,et al.  AnnCorra : Annotating Corpora Guidelines For POS And Chunk Annotation For Indian Languages , 2008 .

[11]  Serge Sharoff,et al.  Cross Language POS Taggers (and other Tools) for Indian Languages: An Experiment with Kannada using Telugu Resources , 2011 .

[12]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[13]  Sandipan Dandapat,et al.  Part-of-Speech Tagging and Chunking with Maximum Entropy Model , 2006 .

[14]  Dipti Misra Sharma,et al.  Shallow Parsing for South Asian Languages , 2007 .