Named entity recognition with multiple segment representations

Named entity recognition (NER) is mostly formalized as a sequence labeling problem in which segments of named entities are represented by label sequences. Although a considerable effort has been made to investigate sophisticated features that encode textual characteristics of named entities (e.g. PEOPLE, LOCATION, etc.), little attention has been paid to segment representations (SRs) for multi-token named entities (e.g. the IOB2 notation). In this paper, we investigate the effects of different SRs on NER tasks, and propose a feature generation method using multiple SRs. The proposed method allows a model to exploit not only highly discriminative features of complex SRs but also robust features of simple SRs against the data sparseness problem. Since it incorporates different SRs as feature functions of Conditional Random Fields (CRFs), we can use the well-established procedure for training. In addition, the tagging speed of a model integrating multiple SRs can be accelerated equivalent to that of a model using only the most complex SR of the integrated model. Experimental results demonstrate that incorporating multiple SRs into a single model improves the performance and the stability of NER. We also provide the detailed analysis of the results.

[1]  Jun'ichi Tsujii,et al.  Bidirectional Inference with the Easiest-First Strategy for Tagging Sequence Data , 2005, HLT.

[2]  Chun-Nan Hsu,et al.  Integrating high dimensional bi-directional parsing models for gene mention tagging , 2008, ISMB.

[3]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[4]  Burr Settles,et al.  Biomedical Named Entity Recognition using Conditional Random Fields and Rich Feature Sets , 2004, NLPBA/BioNLP.

[5]  Hai Zhao,et al.  Effective Tag Set Selection in Chinese Word Segmentation via Conditional Random Field Modeling , 2006, PACLIC.

[6]  Ben Taskar,et al.  An Introduction to Conditional Random Fields for Relational Learning , 2007 .

[7]  Hae-Chang Rim,et al.  Biomedical named entity recognition using two-phase model based on SVMs , 2004, J. Biomed. Informatics.

[8]  Richard Tzong-Han Tsai,et al.  Overview of BioCreative II gene mention recognition , 2008, Genome Biology.

[9]  Graciela Gonzalez,et al.  BANNER: An Executable Survey of Advances in Biomedical Named Entity Recognition , 2007, Pacific Symposium on Biocomputing.

[10]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[11]  Erik F. Tjong Kim Sang,et al.  Representing Text Chunks , 1999, EACL.

[12]  Nianwen Xu,et al.  Chinese Word Segmentation as Character Tagging , 2003, Int. J. Comput. Linguistics Chin. Lang. Process..

[13]  Kentaro Torisawa,et al.  Exploiting Wikipedia as External Knowledge for Named Entity Recognition , 2007, EMNLP.

[14]  Satoshi Sekine,et al.  A survey of named entity recognition and classification , 2007 .

[15]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[16]  T. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2006, Nucleic Acids Res..

[17]  Angie Williams,et al.  Introduction To The Colloquy , 2003 .

[18]  Wei Li,et al.  Early results for Named Entity Recognition with Conditional Random Fields, Feature Induction and Web-Enhanced Lexicons , 2003, CoNLL.

[19]  Dan Roth,et al.  Design Challenges and Misconceptions in Named Entity Recognition , 2009, CoNLL.

[20]  Hongfei Lin,et al.  Incorporating rich background knowledge for gene named entity classification and recognition , 2009, BMC Bioinformatics.

[21]  Yuji Matsumoto,et al.  Chunking with Support Vector Machines , 2001, NAACL.

[22]  Nanda Kambhatla Minority Vote: At-Least-N Voting Improves Recall for Extracting Relations , 2006, ACL.