Recognizing noun phrases in biomedical text : An evaluation of lab prototypes and commercial chunkers

In the biomedical domain, many systems for text mining and information extraction rely on basic morphological and syntactic analysis such as part-of-speech tagging or noun phrase (NP) chunking. Due to the lack of sufficient in-domain resources these systems often make use of NLP tools trained and evaluated on newspaper-language training sets. Scientific texts in the life sciences, however, differ from general language in the structure and complexity of noun phrases. Therefore, we tested the effects this domain change has on the performance of these systems. For this purpose, we compared three prototype chunking systems developed in research labs (all based on statistical learning methods) and one chunking system which is part of a commercial information extraction toolkit (based on manually supplied grammar specifications). Trained on PENN TREEBANK tagging and chunking annotations for newspapers, we ran these systems on the GENIA treebank which contains such annotations for biological abstracts taken from MEDLINE. We, first, observed a significant over-all loss in performance (on the order of 4%) and, second, found (with the exception of the SVM-based system) no significant difference between the performance of lab prototypes and the commerical chunker on GENIA data. Fortunately, the performance loss can also be partly remedied by few biomedical domain-specific adapta-

[1]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[2]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[3]  Eric Brill,et al.  Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-Speech Tagging , 1995, CL.

[4]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[5]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[6]  Dan Roth,et al.  A Learning Approach to Shallow Parsing , 1999, EMNLP.

[7]  Sabine Buchholz,et al.  Introduction to the CoNLL-2000 Shared Task Chunking , 2000, CoNLL/LLL.

[8]  Yuji Matsumoto,et al.  Chunking with Support Vector Machines , 2001, NAACL.

[9]  James Pustejovsky,et al.  Robust Relational Parsing Over Biomedical Literature: Extracting Inhibit Relations , 2001, Pacific Symposium on Biocomputing.

[10]  Jin-Dong Kim,et al.  The GENIA corpus: an annotated research abstract corpus in molecular biology domain , 2002 .

[11]  Lauri Karttunen,et al.  Finite State Morphology , 2003, CSLI Studies in Computational Linguistics.

[12]  K. E. Ravikumar,et al.  A Biological Named Entity Recognizer , 2002, Pacific Symposium on Biocomputing.

[13]  Peer Bork,et al.  Extracting Regulatory Gene Expression Networks From Pubmed , 2004, ACL.

[14]  Udo Hahn,et al.  High-Performance Tagging on Medical Texts , 2004, COLING.

[15]  Jun'ichi Tsujii,et al.  Part-of-Speech Annotation of Biology Research Abstracts , 2004, LREC.

[16]  Udo Hahn,et al.  Really, Is Medical Sublanguage That Different? Experimental Counter-evidence from Tagging Medical and Newspaper Corpora , 2004, MedInfo.

[17]  Hae-Chang Rim,et al.  Incorporating Lexical Knowledge into Biomedical NE Recognition , 2004, NLPBA/BioNLP.

[18]  Kentaro Torisawa,et al.  Improving the Identification of Non-Anaphoric it using Support Vector Machines , 2004, NLPBA/BioNLP.