Tamil NER - Coping with Real Time Challenges

This paper describes various challenges encountered while developing an automatic Named Entity Recognition (NER) using Conditional Random Fields (CRFs) for Tamil. We also discuss how we have overcome some of these challenges. Though most of the challenges in NER discussed here are common to many Indian languages, in this work the focus is on Tamil, a South Indian language belonging to Dravidian language family. The corpus used in this work is the web data. The web data consisted of news paper articles, articles on blog sites and other online web portals.

[1]  Ralph Grishman,et al.  NYU: Description of the MENE Named Entity System as Used in MUC-7 , 1998, MUC.

[2]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[3]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[4]  Malvina Nissim,et al.  Exploiting Context for Biomedical Entity Recognition: From Syntax to the Web , 2004, NLPBA/BioNLP.

[5]  T. V. Geetha,et al.  Named Entity Recognition in Tamil using Context-cues and the E-M algorithm , 2007, IICAI.

[6]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[7]  Martin Hofmann-Apitius,et al.  Named Entity Recognition with Combinations of Conditional Random Fields , 2007 .

[8]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[9]  Pushpak Bhattacharyya,et al.  Think Globally, Apply Locally: Using Distributional Characteristics for Hindi Named Entity Identification , 2010, NEWS@ACL.

[10]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[11]  Pabitra Mitra,et al.  A Hybrid Approach for Named Entity Recognition in Indian Languages , 2008 .

[12]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[13]  Burr Settles,et al.  Biomedical Named Entity Recognition using Conditional Random Fields and Rich Feature Sets , 2004, NLPBA/BioNLP.

[14]  Cheng Niu,et al.  Bootstrapping for Named Entity Tagging Using Concept-based Seeds , 2003, HLT-NAACL.

[15]  Nerea Ezeiza,et al.  Lessons from the Development of a Named Entity Recognizer for Basque , 2006, Proces. del Leng. Natural.

[16]  Vasudeva Varma,et al.  A Language-Independent Approach to Identify the Named Entities in Under-Resourced Languages and Clustering Multilingual Documents , 2011, CLEF.

[17]  Hitoshi Isahara,et al.  Chinese Named Entity Recognition with Conditional Random Fields , 2006, SIGHAN@COLING/ACL.

[18]  Dipti Misra Sharma,et al.  Aggregating Machine Learning and Rule Based Heuristics for Named Entity Recognition , 2008, IJCNLP.

[19]  Sivaji Bandyopadhyay,et al.  A Conditional Random Field Approach for Named Entity Recognition in Bengali and Hindi , 2009 .

[20]  P. M. Yohan,et al.  A Survey on Named Entity Recognition in Indian Languages with particular reference to Telugu , 2011 .