Named Entity Recognition for Kannada using Gazetteers list with Conditional Random Fields

Named Entities (NEs) that exist in the sentences are essential to build Natural Language Processing (NLP) applications for Information Extraction (IE) from large corpora. However, generating a large corpus is challenging for resource poor languages, such as Kannada. Further, there is no annotated corpus available online. The challenges faced in annotating NEs with pre-defined classes are: It is morphologically joined with other words and the spelling variations are more frequent for Kannada words. Sentence structure varies according to morphology, parts of speech (pos) and chunking of a language. These parameters differ from one language to another. To address these challenges, a novel application system is proposed to identify NEs in Kannada using a large corpus of 73,676 tokens. The Named Entity Recognition (NER) system consist of a robust pos tagger and Noun Phrase (NP) chunker developed for generic data. Five gazetteer lists were created from many orthographic patterns for each word. Context information such as previous two words, next two words, word morphology and gazetteer lists were added to feature lists. An unigram-bigram template was designed and incorporated into Conditional Random Fields (CRFs) to generate conditional feature functions. The proposed system resulted in 86.85% and 71.01% f-measure for gold test data and newspaper data respectively.

[1]  Pushpak Bhattacharyya,et al.  Sharing Network Parameters for Crosslingual Named Entity Recognition , 2016, ArXiv.

[2]  David Yarowsky,et al.  Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence , 1999, EMNLP.

[3]  S Amarappa,et al.  Named Entity Recognition and Classification in Kannada Language , 2013 .

[4]  Pabitra Mitra,et al.  A Hybrid Feature Set based Maximum Entropy Hindi Named Entity Recognition , 2008, IJCNLP.

[5]  Kashif Riaz,et al.  Rule-Based Named Entity Recognition in Urdu , 2010, NEWS@ACL.

[6]  Anitha S. Pillai,et al.  Kannpos-Kannada Parts of Speech Tagger Using Conditional Random Fields , 2016 .

[7]  Joel Nothman,et al.  Learning multilingual named entity recognition from Wikipedia , 2013, Artif. Intell..

[8]  Ratna Sanyal,et al.  Named Entity Recognition for Indian Languages , 2008, IJCNLP.

[9]  Wei Li,et al.  Rapid development of Hindi named entity recognition using conditional random fields and feature induction , 2003, TALIP.

[10]  S. V. Sathyanarayana,et al.  Kannada named entity recognition and classification (nerc) based on multinomial naïve bayes (mnb) classifier , 2015, ArXiv.

[11]  S. Lakshmana Pandian,et al.  Hybrid, Three-stage Named Entity Recognizer for Tamil , 2008 .

[12]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[13]  Sivaji Bandyopadhyay,et al.  Bengali Named Entity Recognition Using Support Vector Machine , 2008, IJCNLP.

[14]  Pabitra Mitra,et al.  A Hybrid Approach for Named Entity Recognition in Indian Languages , 2008 .

[15]  Dipti Misra Sharma,et al.  Aggregating Machine Learning and Rule Based Heuristics for Named Entity Recognition , 2008, IJCNLP.

[16]  Rule based Methodology for Recognition of Kannada Named Entities , 2014 .

[17]  Suma Bhat Morpheme Segmentation for Kannada Standing on the Shoulder of Giants , 2012, WSSANLP@COLING.

[18]  James R. Curran,et al.  Language Independent NER using a Maximum Entropy Tagger , 2003, CoNLL.

[19]  Vasudeva Varma,et al.  A Character n-gram Based Approach for Improved Recall in Indian Language NER , 2008, IJCNLP.

[20]  S Amarappa,et al.  A Hybrid approach for Named Entity Recognition , Classification and Extraction ( NERCE ) in Kannada Documents , 2013 .

[21]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[22]  Nazlia Omar,et al.  Using Stanford NER and Illinois NER to Detect Malay Named Entity Recognition , 2017 .

[23]  Sivaji Bandyopadhyay,et al.  Named Entity Recognition in Bengali: A Conditional Random Field Approach , 2008, IJCNLP.

[24]  Yefeng Wang,et al.  Cascading Classifiers for Named Entity Recognition in Clinical Notes , 2009, BiomedicalIE@RANLP.

[25]  Vijay Sundar Ram,et al.  Chunker and Hybrid POS Tagger for Indian Languages , 2006 .

[26]  Shahrul Azman Mohd Noah,et al.  Malay Name Entity Recognition Using Limited Resources , 2016 .

[27]  Kavi Narayana Murthy,et al.  Named Entity Recognition for Telugu , 2008, IJCNLP.

[28]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[29]  Sobha Lalitha Devi,et al.  Domain Focused Named Entity Recognizer for Tamil Using Conditional Random Fields , 2008, IJCNLP.

[30]  Malarkodi C.S,et al.  Tamil NER - Coping with Real Time Challenges , 2012 .