Distributed modules for text annotation and IE applied to the biomedical domain

Biological databases contain facts from scientific literature, which have been curated by hand to ensure high quality. Curation is time-consuming and can be supported by information extraction methods. We present a server which identifies biological facts in scientific text and presents the annotation to the curator. Such facts are: UniProt, UMLS and GO terminology, identification of gene and protein names, mutations and protein-protein interactions. UniProt, UMLS and GO concepts are automatically linked to the original source. The module for mutations is based on syntax patterns and the one for protein-protein interactions on NLP. All modules work independently of each other in single threads and are combined in a pipeline to ensure proper meta data integration. For fast response time the modules are distributed on a Linux cluster. The server is at present available to curation teams of biomedical data and will be opened to the public in the future.

[1]  D. Rebholz-Schuhmann,et al.  Computer-assisted generation of a protein-interaction database for nuclear receptors. , 2003, Molecular endocrinology.

[2]  Michael Krauthammer,et al.  GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data , 2004, J. Biomed. Informatics.

[3]  Peter Willett,et al.  Protein Structures and Information Extraction from Biological Texts: The PASTA System , 2003, Bioinform..

[4]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from full texts , 2004, Bioinform..

[5]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[6]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[7]  Alexander A. Morgan,et al.  Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup , 2003, ISMB.

[8]  Miguel A. Andrade-Navarro,et al.  Information extraction from full text scientific articles: Where are the keywords? , 2003, BMC Bioinformatics.

[9]  Daniel Hanisch,et al.  Playing Biology's Name Game: Identifying Protein Names in Scientific Text , 2002, Pacific Symposium on Biocomputing.

[10]  Russ B. Altman,et al.  Research Paper: Creating an Online Dictionary of Abbreviations from MEDLINE , 2002, J. Am. Medical Informatics Assoc..

[11]  G. Casari,et al.  Automatic extraction of mutations from Medline and cross-validation with OMIM. , 2004, Nucleic acids research.

[12]  Mark R. Gilder,et al.  Extraction of protein interaction information from unstructured text using a context-free grammar , 2003, Bioinform..