Graph-based sequence annotation using a data integration approach

The automated annotation of data from high throughput sequencing and genomics experiments is a significant challenge for bioinformatics. Most current approaches rely on sequential pipelines of gene finding and gene function prediction methods that annotate a gene with information from different reference data sources. Each function prediction method contributes evidence supporting a functional assignment. Such approaches generally ignore the links between the information in the reference datasets. These links, however, are valuable for assessing the plausibility of a function assignment and can be used to evaluate the confidence in a prediction. We are working towards a novel annotation system that uses the network of information supporting the function assignment to enrich the annotation process for use by expert curators and predicting the function of previously unannotated genes. In this paper we describe our success in the first stages of this development. We present the data integration steps that are needed to create the core database of integrated reference databases (UniProt, PFAM, PDB, GO and the pathway database Ara-Cyc) which has been established in the ONDEX data integration system. We also present a comparison between different methods for integration of GO terms as part of the function assignment pipeline and discuss the consequences of this analysis for improving the accuracy of gene function annotation. The methods and algorithms presented in this publication are an integral part of the ONDEX system which is freely available from http://ondex.sf.net/.

[1]  Christopher J. Rawlings,et al.  Linking experimental results, biological networks and sequence analysis methods using Ontologies and Generalized Data Structures , 2004, Silico Biol..

[2]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[3]  Christopher J. Rawlings,et al.  The OXL format for the exchange of integrated datasets , 2007, J. Integr. Bioinform..

[4]  Éric Gaussier,et al.  A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation , 2005, ECIR.

[5]  Eric K. Neumann,et al.  Pacific Symposium on Biocomputing 11:176-187(2006) BIODASH: A SEMANTIC WEB DASHBOARD FOR DRUG DEVELOPMENT , 2022 .

[6]  M. Scott Marshall,et al.  A semantic web approach applied to integrative bioinformatics experimentation: a biological use case with genomics data , 2007, Bioinform..

[7]  Jacob Köhler,et al.  Integration of life science databases , 2004 .

[8]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[9]  J L Sussman,et al.  Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. , 1998, Acta crystallographica. Section D, Biological crystallography.

[10]  Matej Oresic,et al.  Data integration and visualization system for enabling conceptual biology , 2005, ISMB.

[11]  S. Salzberg Genome re-annotation: a wiki solution? , 2007, Genome Biology.

[12]  Christina Backes,et al.  BN++ - A Biological Information System , 2006, J. Integr. Bioinform..

[13]  Janan T. Eppig,et al.  The Mouse Gene Expression Database (GXD) , 2001, Nucleic Acids Res..

[14]  D. Barrell,et al.  The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. , 2003, Genome research.

[15]  Sergei Egorov,et al.  Pathway studio - the analysis and navigation of molecular networks , 2003, Bioinform..

[16]  Nils Blüthgen,et al.  Biological profiling of gene groups utilizing Gene Ontology. , 2004, Genome informatics. International Conference on Genome Informatics.

[17]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[18]  S. Rhee,et al.  AraCyc: A Biochemical Pathway Database for Arabidopsis1 , 2003, Plant Physiology.

[19]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[20]  Stan Matwin,et al.  Functional Annotation of Genes Using Hierarchical Text Categorization , 2005 .

[21]  Amarnath Gupta,et al.  PathSys: integrating molecular interaction graphs for systems biology , 2006, BMC Bioinformatics.

[22]  Janan T. Eppig,et al.  The mouse Gene Expression Database (GXD): 2017 update , 2016, Nucleic Acids Res..

[23]  Amos Bairoch,et al.  The ENZYME database in 2000 , 2000, Nucleic Acids Res..

[24]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[25]  Christopher J. Rawlings,et al.  Graph-based analysis and visualization of experimental results with ONDEX , 2006, Bioinform..