Nested Named Entity Recognition

Many named entities contain other named entities inside them. Despite this fact, the field of named entity recognition has almost entirely ignored nested named entity recognition, but due to technological, rather than ideological reasons. In this paper, we present a new technique for recognizing nested named entities, by using a discriminative constituency parser. To train the model, we transform each sentence into a tree, with constituents for each named entity (and no other syntactic structure). We present results on both newspaper and biomedical corpora which contain nested named entities. In three out of four sets of experiments, our model outperforms a standard semi-CRF on the more traditional top-level entities. At the same time, we improve the overall F-score by up to 30% over the flat model, which is unable to recover any nested entities.

[1]  Kate Byrne Nested Named Entity Recognition in Historical Archive Text , 2007, International Conference on Semantic Computing (ICSC 2007).

[2]  Jian Su,et al.  Recognizing Names in Biomedical Texts: a Machine Learning Approach , 2004 .

[3]  Alexander Clark,et al.  Combining Distributional and Morphological Information for Part of Speech Induction , 2003, EACL.

[4]  William W. Cohen,et al.  Semi-Markov Conditional Random Fields for Information Extraction , 2004, NIPS.

[5]  Mihai Surdeanu,et al.  UPC: Experiments with Joint Learning within SemEval Task 9 , 2007, SemEval@ACL.

[6]  Mariona Taulé,et al.  DRAFT VERSION 1 AnCora : Multilingual and Multilevel Annotated Corpora , 2007 .

[7]  Jian Su,et al.  Effective Adaptation of Hidden Markov Model-based Named Entity Recognizer for Biomedical Domain , 2003, BioNLP@ACL.

[8]  Jian Su,et al.  Exploring Deep Knowledge Resources in Biomedical Name Recognition , 2004, NLPBA/BioNLP.

[9]  Erik F. Tjong Kim Sang,et al.  Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition , 2003, CoNLL.

[10]  Jin-Dong Kim,et al.  The GENIA corpus: an annotated research abstract corpus in molecular biology domain , 2002 .

[11]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[12]  Jian Su,et al.  Enhancing HMM-based biomedical named entity recognition by studying special phenomena , 2004, J. Biomed. Informatics.

[13]  Galen Andrew,et al.  A Hybrid Markov/Semi-Markov Conditional Random Field for Sequence Segmentation , 2006, EMNLP.

[14]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[15]  Mariona Taulé,et al.  SemEval-2007 Task 09: Multilevel Semantic Annotation of Catalan and Spanish , 2007, *SEMEVAL.

[16]  Lluís Màrquez i Villodre,et al.  SemEval-2007 Task 09: Multilevel Semantic Annotation of Catalan and Spanish , 2007, SemEval@ACL.

[17]  Malvina Nissim,et al.  Exploring the boundaries: gene and protein identification in biomedical text , 2005, BMC Bioinformatics.

[18]  Ralph Grishman,et al.  A Maximum Entropy Approach to Named Entity Recognition , 1999 .

[19]  Baohua Gu Recognizing Nested Named Entities in GENIA corpus , 2006, BioNLP@NAACL-HLT.

[20]  James R. Curran,et al.  Language Independent NER using a Maximum Entropy Tagger , 2003, CoNLL.

[21]  Koby Crammer,et al.  Flexible Text Segmentation with Structured Multilabel Classification , 2005, HLT.

[22]  Beatrice Alex,et al.  Recognising Nested Named Entities in Biomedical Text , 2007, BioNLP@ACL.

[23]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[24]  Jun'ichi Tsujii,et al.  Tuning support vector machines for biomedical named entity recognition , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[25]  Jun'ichi Tsujii,et al.  Boosting Precision and Recall of Dictionary-Based Protein Name Recognition , 2003, BioNLP@ACL.

[26]  G. D. Zhou,et al.  Recognizing names in biomedical texts using mutual information independence model and SVM plus sigmoid , 2006, Int. J. Medical Informatics.

[27]  Su Jian,et al.  Exploring Deep Knowledge Resources in Biomedical Name Recognition , 2004, NLPBA/BioNLP.

[28]  Christopher D. Manning,et al.  Efficient, Feature-based, Conditional Random Field Parsing , 2008, ACL.