Aggregating Machine Learning and Rule Based Heuristics for Named Entity Recognition

This paper, submitted as an entry for the NERSSEAL-2008 shared task, describes a system build for Named Entity Recognition for South and South East Asian Languages. Our paper combines machine learning techniques with language specific heuristics to model the problem of NER for Indian languages. The system has been tested on five languages: Telugu, Hindi, Bengali, Urdu and Oriya. It uses CRF (Conditional Random Fields) based machine learning, followed by post processing which involves using some heuristics or rules. The system is specifically tuned for Hindi and Telugu, we also report the results for the other four languages.

[1]  Anil Kumar Singh,et al.  Can Corpus Based Measures be Used for Comparative Study of Languages? , 2007, SIGMORPHON.

[2]  Pat Langley,et al.  Editorial: On Machine Learning , 1986, Machine Learning.

[3]  Anil Kumar Singh,et al.  A More Discerning and Adaptable Multilingual Transliteration Mechanism for Indian Languages , 2008, IJCNLP.

[4]  David D. McDonald Internal and External Evidence in the Identification and Semantic Categorization of Proper Names , 1993 .

[5]  Wei Li,et al.  Rapid development of Hindi named entity recognition using conditional random fields and feature induction , 2003, TALIP.

[6]  Yorick Wilks,et al.  Evaluation of an Algorithm for the Recognition and Classification of Proper Names , 1996, COLING.

[7]  Richard M. Schwartz,et al.  Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[8]  Andrew McCallum,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[9]  Dipti Misra Sharma,et al.  Dependency Annotation Scheme for Indian Languages , 2008, IJCNLP.

[10]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[11]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[12]  Avinesh Pvs,et al.  Part-Of-Speech Tagging and Chunking using Conditional Random Fields and Transformation Based Learning , 2006 .

[13]  Vasudeva Varma,et al.  A Character n-gram Based Approach for Improved Recall in Indian Language NER , 2008, IJCNLP.

[14]  Pabitra Mitra,et al.  A Hybrid Approach for Named Entity Recognition in Indian Languages , 2008 .

[15]  David Yarowsky,et al.  Language Independent Named Entity Recognition Combining Morphological and Contextual Evidence , 1999, EMNLP.

[16]  Bogdan Babych,et al.  Improving Machine Translation Quality with Automatic Named Entity Recognition , 2003, Proceedings of the 7th International EAMT workshop on MT and other Language Technology Tools, Improving MT through other Language Technology Tools Resources and Tools for Building MT - EAMT '03.

[17]  Fernando Pereira,et al.  Shallow Parsing with Conditional Random Fields , 2003, NAACL.

[18]  Rohini K. Srihari,et al.  A Hybrid Approach for Named Entity and Sub-Type Tagging , 2000, ANLP.