Identification of key concepts in biomedical literature using a modified Markov heuristic

MOTIVATION The recent explosion of interest in mining the biomedical literature for associations between defined entities such as genes, diseases and drugs has made apparent the need for robust methods of identifying occurrences of these entities in biomedical text. Such concept-based indexing is strongly dependent on the availability of a comprehensive ontology or lexicon of biomedical terms. However, such ontologies are very difficult and expensive to construct, and often require extensive manual curation to render them suitable for use by automatic indexing programs. Furthermore, the use of statistically salient noun phrases as surrogates for curated terminology is not without difficulties, due to the lack of high-quality part-of-speech taggers specific to medical nomenclature. RESULTS We describe a method of improving the quality of automatically extracted noun phrases by employing prior knowledge during the HMM training procedure for the tagger. This enhancement, when combined with appropriate training data, can greatly improve the quality and relevance of the extracted phrases, thereby enabling greater accuracy in downstream literature mining tasks.

[1]  Jorn Veenstra Sabine Buchholz Fast NP Chunking Using Memory-Based Learning Techniques , 1998 .

[2]  Erik F. Tjong Kim Sang,et al.  Representing Text Chunks , 1999, EACL.

[3]  Miguel A. Andrade-Navarro,et al.  Automatic Extraction of Biological Information from Scientific Text: Protein-Protein Interactions , 1999, ISMB.

[4]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[5]  B J Stapley,et al.  Biobibliometrics: information retrieval and visualization from co-occurrences of gene names in Medline abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[6]  D. Lindberg,et al.  Building the Unified Medical Language System , 1989 .

[7]  D. Lindberg,et al.  Unified Medical Language System , 2020, Definitions.

[8]  Eugene Charniak,et al.  Equations for Part-of-Speech Tagging , 1993, AAAI.

[9]  Eric Brill,et al.  Tagging an Unfamiliar Text With Minimal Human Supervision , 1992 .

[10]  Miguel A. Andrade-Navarro,et al.  Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families , 1998, Bioinform..

[11]  W. Nelson Francis,et al.  FREQUENCY ANALYSIS OF ENGLISH USAGE: LEXICON AND GRAMMAR , 1983 .

[12]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[13]  Park,et al.  Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts. , 1998, Genome informatics. Workshop on Genome Informatics.

[14]  Peter L. Elkin,et al.  UMLS Concept Indexing for Production Databases: A Feasibility Study , 2001, J. Am. Medical Informatics Assoc..

[15]  Ng,et al.  Toward Routine Automatic Pathway Discovery from On-line Scientific Text Abstracts. , 1999, Genome informatics. Workshop on Genome Informatics.

[16]  Jian Su,et al.  Hybrid Text Chunking , 2000, CoNLL/LLL.

[17]  C. Ouzounis,et al.  Automatic extraction of protein interactions from scientific abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[18]  L Hunter,et al.  MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. , 1999, BioTechniques.

[19]  G Demetriou,et al.  Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[20]  Don R. Swanson,et al.  Link Analysis of MEDLINE Titles as an Aid to Scientific Discovery , 1998 .

[21]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[22]  Ioannis Xenarios,et al.  Mining literature for protein-protein interactions , 2001, Bioinform..

[23]  Toshihisa Takagi,et al.  Automated extraction of information on protein-protein interactions from the biological literature , 2001, Bioinform..

[24]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[25]  Jian Su,et al.  Error-driven HMM-based Chunk Tagger with Context-dependent Lexicon , 2000, EMNLP.

[26]  William H. Majoros,et al.  Genomics and natural language processing , 2002, Nature Reviews Genetics.

[27]  Jun'ichi Tsujii,et al.  Event Extraction from Biomedical Papers Using a Full Parser , 2000, Pacific Symposium on Biocomputing.

[28]  Thomas C. Rindflesch,et al.  EDGAR: extraction of drugs, genes and relations from the biomedical literature. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[29]  Mitchell P. Marcus,et al.  Text Chunking using Transformation-Based Learning , 1995, VLC@ACL.

[30]  Toshihisa Takagi,et al.  PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary , 2000, Bioinform..

[31]  Walter Daelemans,et al.  Applying System Combination to Base Noun Phrase Identification , 2000, COLING.

[32]  Ferran Plà,et al.  Tagging and Chunking with Bigrams , 2000, COLING.

[33]  Penelope Sibun,et al.  A Practical Part-of-Speech Tagger , 1992, ANLP.