A shallow parser based on closed-class words to capture relations in biomedical text

Natural language processing for biomedical text currently focuses mostly on entity and relation extraction. These entities and relations are usually pre-specified entities, e.g., proteins, and pre-specified relations, e.g., inhibit relations. A shallow parser that captures the relations between noun phrases automatically from free text has been developed and evaluated. It uses heuristics and a noun phraser to capture entities of interest in the text. Cascaded finite state automata structure the relations between individual entities. The automata are based on closed-class English words and model generic relations not limited to specific words. The parser also recognizes coordinating conjunctions and captures negation in text, a feature usually ignored by others. Three cancer researchers evaluated 330 relations extracted from 26 abstracts of interest to them. There were 296 relations correctly extracted from the abstracts resulting in 90% precision of the relations and an average of 11 correct relations per abstract.

[1]  Proux,et al.  Detecting Gene Symbols and Names in Biological Texts: A First Step toward Pertinent Information Extraction. , 1998, Genome informatics. Workshop on Genome Informatics.

[2]  L. Ohno-Machado Journal of Biomedical Informatics , 2001 .

[3]  G Hripcsak,et al.  Evaluating Natural Language Processors in the Clinical Domain , 1998, Methods of Information in Medicine.

[4]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[5]  Victor Maojo,et al.  Medical Informatics and Bioinformatics: European Efforts to Facilitate Synergy , 2001, J. Biomed. Informatics.

[6]  Kenneth Ward Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted Text , 1988, ANLP.

[7]  Peter Willett,et al.  Protein Structures and Information Extraction from Biological Texts: The PASTA System , 2003, Bioinform..

[8]  G. Tottie Negation in English speech and writing : a study in variation , 1993 .

[9]  Vasileios Hatzivassiloglou,et al.  Disambiguating proteins, genes, and RNA in text: a machine learning approach , 2001, ISMB.

[10]  Yves Schabes,et al.  Parsing with Finite-State Transducers , 1997 .

[11]  Eric Brill,et al.  A Rule-Based Approach to Prepositional Phrase Attachment Disambiguation , 1994, COLING.

[12]  Emmanuel Roche,et al.  Finite-State Language Processing , 1997 .

[13]  Hsinchun Chen,et al.  Filling Preposition-Based Templates to Capture Information from Medical Abstracts , 2001, Pacific Symposium on Biocomputing.

[14]  Prakash M. Nadkarni,et al.  Research Paper: Use of General-purpose Negation Detection to Augment Concept Indexing of Medical Documents: A Quantitative Study Using the UMLS , 2001, J. Am. Medical Informatics Assoc..

[15]  William R. Hersh,et al.  Assessing the feasibility of large-scale natural language processing in a corpus of ordinary medical records: a lexical analysis , 1997, AMIA.

[16]  Jin-Dong Kim,et al.  The GENIA corpus: an annotated research abstract corpus in molecular biology domain , 2002 .

[17]  Michael Krauthammer,et al.  GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles , 2001, ISMB.

[18]  Yoram Singer,et al.  Boosting Applied to Tagging and PP Attachment , 1999, EMNLP.

[19]  Julia Jolly,et al.  Prepositional Analysis Within the Framework of Role and Reference Grammar , 1991 .

[20]  Ng,et al.  Toward Routine Automatic Pathway Discovery from On-line Scientific Text Abstracts. , 1999, Genome informatics. Workshop on Genome Informatics.

[21]  K. Bretonnel Cohen,et al.  Contrast and variability in gene names , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[22]  Yves Schabes,et al.  Deterministic Part-of-Speech Tagging with Finite-State Transducers , 1995, Comput. Linguistics.

[23]  Sebastian van Delden,et al.  Combining finite state automata and a greedy learning algorithm to determine the syntactic roles of commas , 2002, 14th IEEE International Conference on Tools with Artificial Intelligence, 2002. (ICTAI 2002). Proceedings..

[24]  Donald Hindle,et al.  Deterministic Parsing of Syntactic Non-fluencies , 1983, ACL.

[25]  Carol Friedman,et al.  Research Paper: A General Natural-language Text Processor for Clinical Radiology , 1994, J. Am. Medical Informatics Assoc..

[26]  Gregory Grefenstette Light parsing as finite state filtering , 1999 .

[27]  Richard M. Schwartz,et al.  Coping with Ambiguity and Unknown Words through Probabilistic Models , 1993, CL.

[28]  Sofie Johansson Kokkinakis,et al.  A Cascaded Finite-State Parser for Syntactic Analysis of Swedish , 1999, EACL.

[29]  Alfonso Valencia,et al.  Can Bibliographic Pointers for Known Biological Data Be Found Automatically? Protein Interactions as a Case Study , 2001, Comparative and functional genomics.

[30]  Lawrence Hunter,et al.  Mining molecular binding terminology from biomedical text , 1999, AMIA.

[31]  SchabesYves,et al.  Deterministic part-of-speech tagging with finite-state transducers , 1995 .

[32]  Carol Friedman,et al.  Limited parsing of notational text visit notes: ad-hoc vs. NLP approaches , 2000, AMIA.

[33]  T. Jenssen,et al.  A literature network of human genes for high-throughput analysis of gene expression , 2001, Nature Genetics.

[34]  Carol Friedman,et al.  A broad-coverage natural language processing system , 2000, AMIA.

[35]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[36]  James H. Martin,et al.  Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition , 2000 .

[37]  Peter Norvig,et al.  Text-Based Intelligent Systems , 1994, Artif. Intell..

[38]  G. Pullum,et al.  The Cambridge Grammar of the English Language , 2002 .

[39]  Alexa T. McCray,et al.  Research Paper: Evaluating the Coverage of Controlled Health Data Terminologies: Report on the Results of the NLM/AHCPR Large Scale Vocabulary Test , 1997, J. Am. Medical Informatics Assoc..

[40]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[41]  Lluís Padró,et al.  Developing a hybrid NP parser , 1997, ANLP.

[42]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[43]  Wendy W. Chapman,et al.  A Simple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries , 2001, J. Biomed. Informatics.

[44]  Allen C. Browne,et al.  UMLS knowledge for biomedical language processing. , 1993, Bulletin of the Medical Library Association.

[45]  James Pustejovsky,et al.  Robust Relational Parsing Over Biomedical Literature: Extracting Inhibit Relations , 2001, Pacific Symposium on Biocomputing.

[46]  Sue Povey,et al.  Genew: the Human Gene Nomenclature Database , 2002, Nucleic Acids Res..

[47]  Adwait Ratnaparkhi,et al.  A Maximum Entropy Model for Prepositional Phrase Attachment , 1994, HLT.

[48]  C. Ouzounis,et al.  Automatic extraction of protein interactions from scientific abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[49]  Adwait Ratnaparkhi Statistical Models for Unsupervised Prepositional Phrase Attachment , 1998, COLING.

[50]  Simon Thompson Regular Expressions and Automata using Miranda , 1995 .

[51]  Hongfang Liu,et al.  Disambiguating Ambiguous Biomedical Terms in Biomedical Narrative Text: An Unsupervised Method , 2001, J. Biomed. Informatics.

[52]  J. Blake,et al.  Creating the Gene Ontology Resource : Design and Implementation The Gene Ontology Consortium 2 , 2001 .

[53]  Park,et al.  Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts. , 1998, Genome informatics. Workshop on Genome Informatics.

[54]  Jeffrey T. Chang,et al.  Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. , 2002, Genome research.

[55]  Hsinchun Chen,et al.  Comparing noun phrasing techniques for use with medical digital library tools , 2000 .

[56]  Steven P. Abney Partial parsing via finite-state cascades , 1996, Natural Language Engineering.

[57]  Henry J. Lowe,et al.  Selective Automated Indexing of Findings and Diagnoses in Radiology Reports , 2001, J. Biomed. Informatics.

[58]  David D. McDonald,et al.  Robust partial-parsing through incremental, multi-algorithm processing , 1992 .

[59]  C. Friedman,et al.  Using BLAST for identifying gene and protein names in journal articles. , 2000, Gene.

[60]  Jun'ichi Tsujii,et al.  Tuning support vector machines for biomedical named entity recognition , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.