Combining Biological Databases and Text Mining to Support New Bioinformatics Applications

A large amount of biological knowledge today is only available from full-text research papers. Since neither manual database curators nor users can keep up with the rapidly expanding volume of scientific literature, natural language processing approaches are becoming increasingly important for bioinformatic projects. In this paper, we go beyond simply extracting information from full-text articles by describing an architecture that supports targeted access to information from biological databases using the results derived from text mining of research papers, thereby integrating information from both sources within a biological application. The described architecture is currently being used to extract information about protein mutations from full-text research papers. Text mining results drive the retrieval of sequence information from protein databases and the employment of algorithmic sequence analysis tools, which facilitate further data access from protein structure databases. Complex mapping of NLP derived text annotations to protein structures allows the rendering, with 3D structure visualization, of information not available in databases of mutation annotations.

[1]  Razif R. Gabdoulline,et al.  ProSAT: functional annotation of protein 3D structures , 2003, Bioinform..

[2]  Antje Chang,et al.  BRENDA , the enzyme database : updates and major new developments , 2003 .

[3]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Hans-Michael Müller,et al.  Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature , 2004, PLoS biology.

[5]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[6]  René Witte,et al.  Enriching Protein Structure Visualizations with Mutation Annotations Obtained by Text Mining Protein Engineering Literature , 2004 .

[7]  Ralf Krestel,et al.  {An Integration Architecture for User-Centric Document Creation, Retrieval, and Analysis} , 2004, VLDB 2004.

[8]  Motonori Ota,et al.  The Protein Mutant Database , 1999, Nucleic Acids Res..

[9]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[10]  William B. Langdon,et al.  BioRAT: extracting biological information from full-length papers , 2004, Bioinform..

[11]  Mário J. Silva,et al.  ProFAL: PROtein Functional Annotation through Literature , 2003, JISBD.

[12]  Benjamin A. Shoemaker,et al.  CDD: a database of conserved domain alignments with links to domain three-dimensional structure , 2002, Nucleic Acids Res..