The first named entity recognizer in Maithili: Resource creation and system development

In this paper, we present our effort on the development of a Maithili Named Entity Recognition (NER) system. Maithili is one of the official languages of India, with around 50 million native speakers. Although various NER systems have been developed in several Indian languages, we did not find any openly available NER resource or system in Maithili. For the development, we manually annotated a Maithili NER corpus containing around 200K words. We prepared a baseline classifier using Conditional Random Fields (CRF). Then we ran many experiments using various recurrent neural networks (RNN). We collected larger raw corpus to obtain better word embedding and character embedding. In our experiments, we found, neural models are better than CRF; a CRF layer is effective for the prediction of the final output in the RNN models; character embedding is effective in Maithili language. We also investigated the effectiveness of gazetteer lists in neural models. We prepared a few gazetteer lists from various web resources and used those in the neural models. The incorporation of the gazetteer layer caused performance improvement. The final system achieved an f-measure of 91.6% with 94.9% precision and 88.53% recall.

[1]  Iryna Gurevych,et al.  Reporting Score Distributions Makes a Difference: Performance Study of LSTM-networks for Sequence Tagging , 2017, EMNLP.

[2]  Gupta Vishal,et al.  Named Entity Recognition for Punjabi Language Text Summarization , 2011 .

[3]  Manish Kumar,et al.  Recent Named Entity Recognition and Classification techniques: A systematic review , 2018, Comput. Sci. Rev..

[4]  Sivaji Bandyopadhyay,et al.  Named Entity Recognition and transliteration in Bengali , 2007 .

[5]  Ameya Prabhu,et al.  Towards Deep Learning in Hindi NER: An approach to tackle the Labelled Data Sparsity , 2016, ICON.

[6]  Wei Li,et al.  Rapid development of Hindi named entity recognition using conditional random fields and feature induction , 2003, TALIP.

[7]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[8]  Steven Bethard,et al.  A Survey on Recent Advances in Named Entity Recognition from Deep Learning models , 2018, COLING.

[9]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[10]  Pabitra Mitra,et al.  A comparative study on feature reduction approaches in Hindi and Bengali named entity recognition , 2012, Knowl. Based Syst..

[11]  Ralph Grishman,et al.  Message Understanding Conference- 6: A Brief History , 1996, COLING.

[12]  Utpal Garain,et al.  Named Entity Recognition with Word Embeddings and Wikipedia Categories for a Low-Resource Language , 2017, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[13]  Lalita Bhanu Murthy Neti,et al.  Enhancing the Performance of Telugu Named Entity Recognition Using Gazetteer Features , 2020, Inf..

[14]  Sanghamitra Mohanty,et al.  A Hybrid Oriya Named Entity Recognition System: Integrating HMM with MaxEnt , 2009, 2009 Second International Conference on Emerging Trends in Engineering & Technology.

[15]  Ulf Leser,et al.  ChemSpot: a hybrid system for chemical named entity recognition , 2012, Bioinform..

[16]  Sujan Kumar Saha,et al.  Towards the first Maithili part of speech tagger: Resource creation and system development , 2020, Comput. Speech Lang..

[17]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[18]  Pabitra Mitra,et al.  Feature selection techniques for maximum entropy based biomedical named entity recognition , 2009, J. Biomed. Informatics.

[19]  Basant Agarwal,et al.  A deep neural network-based model for named entity recognition for Hindi language , 2020, Neural Computing and Applications.

[20]  Timothy Baldwin,et al.  Shared Tasks of the 2015 Workshop on Noisy User-generated Text: Twitter Lexical Normalization and Named Entity Recognition , 2015, NUT@IJCNLP.