Distributed Modules for Text Annotation and IE Applied to the Biomedical Domain

Biological databases contain facts from scientific literature that have been curated by hand to ensure high quality. Curation is time-consuming and can be supported by information extraction methods. We present a server software infrastructure which allows to easily plug in modules to identify biologically interesting pieces of text to be then presented in a web interface to the curator. There are modules which identify UniProt, UMLS and GO terminology, gene and protein names, mutations and protein-protein interactions. UniProt, UMLS and GO concepts are automatically linked to the original source. The module for mutations is based on syntax patterns and the one for protein-protein interactions relies on chunk parsing. All modules work as separate servers possibly distributed on different machines and can be combined into processing pipelines as necessary. Communication is based on XML annotated text streams, each server processing the XML elements it is designed for, and possibly adding more information in the form of XML annotation. The server and the underlying software are available to the public.

[1]  Jeffrey D. Ullman,et al.  Introduction to Automata Theory, Languages and Computation , 1979 .

[2]  Miguel A. Andrade-Navarro,et al.  Information extraction from full text scientific articles: Where are the keywords? , 2003, BMC Bioinformatics.

[3]  Russ B. Altman,et al.  Research Paper: Creating an Online Dictionary of Abbreviations from MEDLINE , 2002, J. Am. Medical Informatics Assoc..

[4]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[5]  Mark R. Gilder,et al.  Extraction of protein interaction information from unstructured text using a context-free grammar , 2003, Bioinform..

[6]  Daniel Hanisch,et al.  : identifying , 2022 .

[7]  Alexander A. Morgan,et al.  Evaluation of text data mining for database curation: lessons learned from the KDD Challenge Cup , 2003, ISMB.

[8]  Michael Krauthammer,et al.  GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data , 2004, J. Biomed. Informatics.

[9]  G. Casari,et al.  Automatic extraction of mutations from Medline and cross-validation with OMIM. , 2004, Nucleic acids research.

[10]  D. Rebholz-Schuhmann,et al.  Computer-assisted generation of a protein-interaction database for nuclear receptors. , 2003, Molecular endocrinology.

[11]  Hao Yu,et al.  Discovering patterns to extract protein-protein interactions from full texts , 2004, Bioinform..

[12]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[13]  Peter Willett,et al.  Protein Structures and Information Extraction from Biological Texts: The PASTA System , 2003, Bioinform..