raiden11@IECSIL-FIRE-2018 : Named Entity Recognition for Indian Languages

This paper presents our solution for the Named Entity Recognition (NER) task for the Information Extractor for Conversational Systems in Indian Languages challenge (IECSIL) [5] of the FIRE 2018 conference. A subset of the Information Extraction (IE) task, NER is a key to extract information and semantics of the text from unstructured data. The objective of NER is the identification and classification of every word or token in a document into predefined categories such as names of person, location, organization, etc. For this challenge the dataset provided by IECSIL [4] comprised of multilingual text of various Indian languages like Hindi, Tamil, Malayalam, Telugu, and Kannada. We mainly focus on the identification and classification of named entities belonging to nine categories like Name, Location, Datenum, etc. We tried linear models like Naive Bayes and SVM, and also a simple Neural Network to solve this problem. The best results are achieved by the simple neural network with an accuracy of 90.33% for all languages combined. This indicates that different advanced neural networks could be possible solutions to further improve this accuracy.

[1]  Geoff Holmes,et al.  Multinomial Naive Bayes for Text Categorization Revisited , 2004, Australian Conference on Artificial Intelligence.

[2]  Pabitra Mitra,et al.  A Hybrid Approach for Named Entity Recognition in Indian Languages , 2008 .

[3]  Diego Mollá Aliod,et al.  Named Entity Recognition for Question Answering , 2006, ALTA.

[4]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[5]  Dipti Misra Sharma,et al.  Aggregating Machine Learning and Rule Based Heuristics for Named Entity Recognition , 2008, IJCNLP.

[6]  Ralph Grishman,et al.  A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[7]  Bogdan Babych,et al.  Improving Machine Translation Quality with Automatic Named Entity Recognition , 2003, Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools Resources and Tools for Building MT - EAMT '03.

[8]  Andreas Salomonsson,et al.  Entity-based information retrieval , 2012 .

[9]  Malarkodi C.S,et al.  Tamil NER - Coping with Real Time Challenges , 2012 .

[10]  Ji-Hwan Kim,et al.  A rule-based named entity recognition system for speech input , 2000, INTERSPEECH.

[11]  P SomanK.,et al.  Overview of Arnekt IECSIL at FIRE-2018 Track on Information Extraction for Conversational Systems in Indian Languages , 2018, FIRE.

[12]  Prakhar Gupta,et al.  Learning Word Vectors for 157 Languages , 2018, LREC.

[13]  S. V. Sathyanarayana,et al.  Kannada named entity recognition and classification (nerc) based on multinomial naïve bayes (mnb) classifier , 2015, ArXiv.

[14]  S Amarappa,et al.  Named Entity Recognition and Classification in Kannada Language , 2013 .

[15]  Pabitra Mitra,et al.  A Hybrid Feature Set based Maximum Entropy Hindi Named Entity Recognition , 2008, IJCNLP.

[16]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[17]  Amandeep Kaur,et al.  Named entity recognition for Punjabi language , 2016 .

[18]  Sivaji Bandyopadhyay,et al.  Named Entity Recognition in Bengali: A Conditional Random Field Approach , 2008, IJCNLP.

[19]  P SomanK.,et al.  Information Extraction for Conversational Systems in Indian Languages - Arnekt IECSIL , 2018, FIRE.

[20]  Rob Malouf,et al.  Markov Models for Language-independent Named Entity Recognition , 2002, CoNLL.

[21]  Anitha S Pillai,et al.  Named Entity Recognition for Indian Languages: A Survey , 2013 .

[22]  Wei Shi,et al.  Attention-Based Bidirectional Long Short-Term Memory Networks for Relation Classification , 2016, ACL.