论文信息 - Automatic acquisition and classification of terminology using a tagged corpus in the molecular biology domain

Automatic acquisition and classification of terminology using a tagged corpus in the molecular biology domain

This article describes our work to identify and classify terms in the domain of molecular biology according to examples that have been marked up by a domain expert in a corpus of abstracts taken from a controlled search of the Medline database. Automatic acquisition of biomedical term lists has so far been slow due to high variability in both the terms and their classification scheme, which we attribute to the diversity of research disciplines involved. Nevertheless, the explosive growth in online molecular biology literature makes a persuasive case for automating many tasks. This includes acquisition of records for gene-product databases such as SwissProt which are currently updated by human experts, a task that is both time consuming and often highly idiosyncratic. In this article we report results from a tool based on a hidden-Markov model for extracting and classifying terms that can be used as a key component in an information extraction system. We discuss the results in light of lexical, syntactic and semantic properties of terms that were revealed by our study.

Nigel Collier | Jun'ichi Tsujii | Chikashi Nobata

[1] M. Sheng,et al. Regulation of NMDA receptors by an associated phosphatase-kinase signaling complex. , 1999, Science.

[2] Park,et al. Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts. , 1998, Genome informatics. Workshop on Genome Informatics.

[3] T. Takagi,et al. Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[4] Richard M. Schwartz,et al. Nymble: a High-Performance Learning Name-finder , 1997, ANLP.

[5] Adwait Ratnaparkhi,et al. A Maximum Entropy Approach to Identifying Sentence Boundaries , 1997, ANLP.

[6] R. Quirk,et al. A Student's Grammar of the English Language , 1990 .

[7] G. William Moore,et al. Barrier word method for detecting molecular biology multiple word terms , 1988 .

[8] Jun'ichi Tsujii,et al. An Ontology for Biological Reaction Events , 1999 .

[9] Nina Wacholder,et al. Disambiguation of Proper Names in Text , 1997, ANLP.

[10] Stanley F. Chen,et al. An Empirical Study of Smoothing Techniques for Language Modeling , 1996, ACL.

[11] Beth Sundheim,et al. MUC-5 Evaluation Metrics , 1993, MUC.