Data integration for plant genomics - exemplars from the integration of Arabidopsis thaliana databases

The development of a systems based approach to problems in plant sciences requires integration of existing information resources. However, the available information is currently often incomplete and dispersed across many sources and the syntactic and semantic heterogeneity of the data is a challenge for integration. In this article, we discuss strategies for data integration and we use a graph based integration method (Ondex) to illustrate some of these challenges with reference to two example problems concerning integration of (i) metabolic pathway and (ii) protein interaction data for Arabidopsis thaliana. We quantify the degree of overlap for three commonly used pathway and protein interaction information sources. For pathways, we find that the AraCyc database contains the widest coverage of enzyme reactions and for protein interactions we find that the IntAct database provides the largest unique contribution to the integrated dataset. For both examples, however, we observe a relatively small amount of data common to all three sources. Analysis and visual exploration of the integrated networks was used to identify a number of practical issues relating to the interpretation of these datasets. We demonstrate the utility of these approaches to the analysis of groups of coexpressed genes from an individual microarray experiment, in the context of pathway information and for the combination of coexpression data with an integrated protein interaction network.

[1]  Peter Clark,et al.  Graph-Based Acquisition of Expressive Knowledge , 2004, EKAW.

[2]  Matthew Suderman,et al.  Tools for visually exploring biological networks , 2007, Bioinform..

[3]  P. Kemmeren,et al.  Protein interaction verification and functional annotation by integrated analysis of genome-scale data. , 2002, Molecular cell.

[4]  C. Deane,et al.  Protein Interactions , 2002, Molecular & Cellular Proteomics.

[5]  Laura M. Haas,et al.  DiscoveryLink: A system for integrated access to life sciences data sources , 2001, IBM Syst. J..

[6]  B. Snel,et al.  Comparative assessment of large-scale data sets of protein–protein interactions , 2002, Nature.

[7]  C. Stoeckert,et al.  OrthoMCL: identification of ortholog groups for eukaryotic genomes. , 2003, Genome research.

[8]  Y. Zhang,et al.  IntAct—open source resource for molecular interaction data , 2006, Nucleic Acids Res..

[9]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[10]  Francesca Chiaromonte,et al.  Qualitative network models and genome-wide expression data define carbon/nitrogen-responsive molecular machines in Arabidopsis , 2007, Genome Biology.

[11]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[12]  Kei-Hoi Cheung,et al.  Semantic Web: Revolutionizing Knowledge Discovery in the Life Sciences , 2006 .

[13]  Andreas D Baxevanis,et al.  Searching NCBI Databases Using Entrez , 2004, Current protocols in bioinformatics.

[14]  Christopher D Town,et al.  Development and evaluation of an Arabidopsis whole genome Affymetrix probe array. , 2004, The Plant journal : for cell and molecular biology.

[15]  Nigel W. Hardy,et al.  Promoting coherent minimum reporting guidelines for biological and biomedical investigations: the MIBBI project , 2008, Nature Biotechnology.

[16]  Imre Vastrik,et al.  Reactome: a knowledgebase of biological pathways , 2004, OTM Workshops.

[17]  Amarnath Gupta,et al.  PathSys: integrating molecular interaction graphs for systems biology , 2006, BMC Bioinformatics.

[18]  L. L. Lloyd,et al.  Enzyme nomenclature — Recommendations of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology: Academic Press Ltd, London, UK, 1992. xiii + 862 pp. Price £40.00. ISBN 0-12-227165-3 , 1994 .

[19]  Amos Bairoch,et al.  The ENZYME database in 2000 , 2000, Nucleic Acids Res..

[20]  Imre Vastrik,et al.  Arabidopsis Reactome: A Foundation Knowledgebase for Plant Systems Biology[W] , 2008, The Plant Cell Online.

[21]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[22]  Christopher J. Rawlings,et al.  Graph-based analysis and visualization of experimental results with ONDEX , 2006, Bioinform..

[23]  Tanya Z. Berardini,et al.  The Arabidopsis Information Resource (TAIR): gene structure and function annotation , 2007, Nucleic Acids Res..

[24]  Hui Lu,et al.  Correlation between gene expression profiles and protein-protein interactions within and across genomes , 2005, Bioinform..

[25]  Kengo Kinoshita,et al.  ATTED-II provides coexpressed gene networks for Arabidopsis , 2008, Nucleic Acids Res..

[26]  Daniel L. Rubin,et al.  Biomedical ontologies: a functional perspective , 2007, Briefings Bioinform..

[27]  Nicole Tourigny,et al.  Bio2RDF: Towards a mashup to build bioinformatics knowledge systems , 2008, J. Biomed. Informatics.

[28]  Golan Yona,et al.  BIOZON: a system for unification, management and analysis of heterogeneous biological data , 2006, BMC Bioinformatics.

[29]  Christopher J. Rawlings,et al.  Linking Life Sciences Data Using Graph-Based Mapping , 2009, DILS.

[30]  Shailesh V. Date,et al.  A Probabilistic Functional Network of Yeast Genes , 2004, Science.

[31]  J. Micol Leaf development: time to turn over a new leaf? , 2009, Current opinion in plant biology.

[32]  P. Shannon,et al.  Cytoscape: a software environment for integrated models of biomolecular interaction networks. , 2003, Genome research.

[33]  Lincoln Stein,et al.  The Plant Ontology Database: a community resource for plant structure and developmental stages controlled vocabulary and annotations , 2008, Nucleic Acids Res..

[34]  Sean R. Eddy,et al.  The Distributed Annotation System , 2001, BMC Bioinformatics.

[35]  Christopher J. Rawlings,et al.  Linking experimental results, biological networks and sequence analysis methods using Ontologies and Generalized Data Structures , 2004, Silico Biol..

[36]  E. Birney,et al.  Reactome: a knowledgebase of biological pathways , 2004, Nucleic Acids Research.

[37]  L. Wong,et al.  Technologies for Integrating Biological Data , 2002, Briefings Bioinform..

[38]  M. Gerstein,et al.  Relating whole-genome expression data with protein-protein interactions. , 2002, Genome research.

[39]  Hans-Peter Braun,et al.  New Insights into the Respiratory Chain of Plant Mitochondria. Supercomplexes and a Unique Composition of Complex II1 , 2003, Plant Physiology.

[40]  Hang Lau,et al.  A Java Library of Graph Algorithms and Optimization (Discrete Mathematics and Its Applications) , 2006 .

[41]  Hsinchun Chen,et al.  A framework of integrating gene relations from heterogeneous data sources: an experiment on Arabidopsis thaliana , 2006, Bioinform..

[42]  Christopher J. Rawlings,et al.  Graph-based sequence annotation using a data integration approach , 2008, J. Integr. Bioinform..

[43]  Cheng-Yan Kao,et al.  POINT: a database for the prediction of protein-protein interactions based on the orthologous interactome , 2004, Bioinform..

[44]  Matthew R. Pocock,et al.  Taverna: a tool for the composition and enactment of bioinformatics workflows , 2004, Bioinform..

[45]  Christian von Mering,et al.  STRING 8—a global view on proteins and their functional interactions in 630 organisms , 2008, Nucleic Acids Res..

[46]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[47]  Damian Smedley,et al.  BioMart – biological queries made easy , 2009, BMC Genomics.

[48]  Christina Backes,et al.  BN++ - A Biological Information System , 2006, J. Integr. Bioinform..

[49]  Reinhard Schneider,et al.  A survey of visualization tools for biological network analysis , 2008, BioData Mining.

[50]  Stephen P Gardner,et al.  Ontologies and semantic data integration. , 2005, Drug discovery today.

[51]  P. D. Karp,et al.  The outcomes of pathway database computations depend on pathway ontology , 2006, Nucleic acids research.

[52]  Chris F. Taylor,et al.  Proteomic Data Exchange and Storage , 2007 .

[53]  Peter D. Karp,et al.  MetaCyc and AraCyc. Metabolic Pathway Databases for Plant Research1[w] , 2005, Plant Physiology.

[54]  M. Stitt,et al.  Genome-Wide Identification and Testing of Superior Reference Genes for Transcript Normalization in Arabidopsis1[w] , 2005, Plant Physiology.

[55]  Florian Iragne,et al.  IPPRED: Server for Proteins Interactions Inference , 2003, Bioinform..

[56]  Kara Dolinski,et al.  The BioGRID Interaction Database: 2008 update , 2008, Nucleic Acids Res..

[57]  Feng Chen,et al.  OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups , 2005, Nucleic Acids Res..

[58]  Barry Smith,et al.  BMC Bioinformatics Methodology article , 2005 .

[59]  Christoph W. Sensen,et al.  Semantic Web Service provision: a realistic framework for Bioinformatics programmers , 2007, Bioinform..

[60]  P. Argos,et al.  SRS: information retrieval system for molecular biology data banks. , 1996, Methods in enzymology.

[61]  C. Gachon,et al.  Transcriptional co-regulation of secondary metabolism enzymes in Arabidopsis: functional and evolutionary implications , 2005, Plant Molecular Biology.

[62]  Kengo Kinoshita,et al.  ATTED-II: a database of co-expressed genes and cis elements for identifying co-regulated gene groups in Arabidopsis , 2006, Nucleic Acids Res..