Exploiting Domain Structure for Named Entity Recognition

Named Entity Recognition (NER) is a fundamental task in text mining and natural language understanding. Current approaches to NER (mostly based on supervised learning) perform well on domains similar to the training domain, but they tend to adapt poorly to slightly different domains. We present several strategies for exploiting the domain structure in the training data to learn a more robust named entity recognizer that can perform well on a new domain. First, we propose a simple yet effective way to automatically rank features based on their generalizabilities across domains. We then train a classifier with strong emphasis on the most generalizable features. This emphasis is imposed by putting a rank-based prior on a logistic regression model. We further propose a domain-aware cross validation strategy to help choose an appropriate parameter for the rank-based prior. We evaluated the proposed method with a task of recognizing named entities (genes) in biology text involving three species. The experiment results show that the new domain-aware approach outperforms a state-of-the-art baseline method in adapting to new domains, especially when there is a great difference between the new domain and the training domain.

[1]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[2]  Yiming Yang,et al.  A Comparative Study on Feature Selection in Text Categorization , 1997, ICML.

[3]  Wei Li,et al.  Information Extraction Supported Question Answering , 1999, TREC.

[4]  Dan Klein,et al.  Named Entity Recognition with Character-Level Models , 2003, CoNLL.

[5]  Rebecca Hwa,et al.  Syntax-based Semi-Supervised Named Entity Tagging , 2005, ACL.

[6]  Tong Zhang,et al.  Named Entity Recognition through Classifier Combination , 2003, CoNLL.

[7]  Jian Su,et al.  Named Entity Recognition using an HMM-based Chunk Tagger , 2002, ACL.

[8]  Tong Zhang,et al.  A High-Performance Semi-Supervised Learning Method for Text Chunking , 2005, ACL.

[9]  Malvina Nissim,et al.  Exploring the boundaries: gene and protein identification in biomedical text , 2005, BMC Bioinformatics.

[10]  Stephan Vogel,et al.  Improved named entity translation and bilingual named entity extraction , 2002, Proceedings. Fourth IEEE International Conference on Multimodal Interfaces.

[11]  Ralph Grishman,et al.  A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[12]  Yoram Singer,et al.  Unsupervised Models for Named Entity Classification , 1999, EMNLP.

[13]  Stanley F. Chen,et al.  A Gaussian Prior for Smoothing Maximum Entropy Models , 1999 .

[14]  Alexander A. Morgan,et al.  Overview of BioCreAtIvE task 1B: normalized gene lists , 2005, BMC Bioinformatics.

[15]  Jian Su,et al.  Multi-Criteria-based Active Learning for Named Entity Recognition , 2004, ACL.

[16]  Y. Altun,et al.  Named-Entity Recognition in Novel Domains with External Lexical Knowledge , 2005 .

[17]  Hermann Ney,et al.  Maximum Entropy Models for Named Entity Recognition , 2003, CoNLL.