Maximum Entropy Approach for Named Entity Recognition in Bengali and Hindi

This paper reports about the development of a Named Entity Recognition (NER) system in two leading Indian languages, namely Bengali and Hindi using the Maximum Entropy (ME) framework. We have used the annotated corpora, obtained from the IJCNLP-08 NER Shared Task on South and South East Asian Languages 1 (NERSSEAL) and tagged with a fine-grained Named Entity (NE) tagset 2 of twelve tags. An appropriate tag conversion routine has been developed in order to convert these corpora to the forms, tagged with the four NE tags, namely Person name, Location name, Organization name and Miscellaneous name. The system makes use of the different contextual information of the words along with the variety of orthographic word-level features that are helpful in predicting the four NE classes. In this work, we have considered language independent features that are applicable to both the languages as well as the language specific features of Bengali and Hindi. Evaluation results show that the use of linguistic features can improve the performance of the system. Evaluation results of the 10-fold cross validation tests yield the overall average recall, precision, and f-score values of 88.01%, 82.63%, and 85.22%, respectively, for Bengali and 86.4%, 79.23%, and 82.66%, respectively, for Hindi.