A Two Stage Language Independent Named Entity Recognition for Indian Languages

This paper describes about the development of a two stage hybrid Named Entity Recognition (NER) system for Indian Languages particularly for Hindi, Oriya, Bengali and Telugu. We have used both statistical Maximum Entropy Model (MaxEnt) and Hidden Markov Model (HMM) in this system. We have used variety of features and contextual information for predicting the various Named Entity (NE) classes. The system uses both language dependent and language independent rules. We have also tried to identify the nested named Entities (NES) by giving some linguistic rules and the rules are purely language independent. We have also used gazetteer list in addition to the rules for Oriya, Bengali and Hindi for better accuracy. The system has been trained with Hindi (450, 150 tokens), Oriya (150, 100 tokens), Bengali (93, 023 tokens), and Telugu (50, 250 tokens). The system has been tested with 35,018 tokens of Hindi 45,100 tokens of Oriya, 28,123 tokens of Bengali and 4,320 tokens of Telugu.