Semantic Data Integration for Francisella tularensis novicida Proteomic and Genomic Data

This paper summarises the lessons and experiences gained from a case study of the application of semantic web technologies to the integration of data from the bacterial species Francisella tularensis novicida (Fn). Fn data sources are disparate and heterogeneous, as multiple laboratories across the world, using multiple technologies, perform experiments to understand the mechanism of virulence. It is hard to integrate such data, and this work examines the role of explicitly provided data semantics in data integration. We test whether the semantic web technologies could be used to reveal previously unknown connections across the available Fn datasets. We combined this data with genome data and with public domain annotations within GO, KEGG and the SUPERFAMILY database. Through this connected graph of database cross references, we extended the annotations of an experimental data set by superimposing onto it the annotation graph. Identifiers used in the experimental data automatically resolved and the data acquired annotations in the rest of the RDF graph. This happened without the expensive manual annotation that would normally be required to produce these links. Other lessons learnt and future challenges that result from this work are also presented in detail.

[1]  G. Peltz,et al.  Identification of complement factor 5 as a susceptibility locus for experimental allergic asthma , 2000, Nature Immunology.

[2]  Barry Smith,et al.  Infectious Disease Ontology , 2010 .

[3]  Michel Dumontier,et al.  yOWL: An ontology-driven knowledge base for yeast biologists , 2008, J. Biomed. Informatics.

[4]  Sanjiva Weerawarana,et al.  Unraveling the Web services web: an introduction to SOAP, WSDL, and UDDI , 2002, IEEE Internet Computing.

[5]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[6]  Kei-Hoi Cheung,et al.  YeastHub: a semantic web use case for integrating data in the life sciences domain , 2005, ISMB.

[7]  Carole A. Goble,et al.  Applying Semantic Web Services to Bioinformatics: Experiences Gained, Lessons Learnt , 2004, SEMWEB.

[8]  J R Yates,et al.  Mass spectrometry. From genomics to proteomics. , 2000, Trends in genetics : TIG.

[9]  Tao Xu,et al.  Atlas – a data warehouse for integrative bioinformatics , 2005, BMC Bioinformatics.

[10]  P D Karp,et al.  Database links are a foundation for interoperability. , 1996, Trends in biotechnology.

[11]  B. Palsson,et al.  The model organism as a system: integrating 'omics' data sets , 2006, Nature Reviews Molecular Cell Biology.

[12]  Dan Brickley,et al.  Resource Description Framework (RDF) Model and Syntax Specification , 2002 .

[13]  Kei-Hoi Cheung,et al.  LinkHub: a Semantic Web system that facilitates cross-database queries and information retrieval in proteomics , 2007, BMC Bioinformatics.

[14]  P. Argos,et al.  SRS: information retrieval system for molecular biology data banks. , 1996, Methods in enzymology.

[15]  C. Hack,et al.  Integrated transcriptome and proteome data: the challenges ahead. , 2004, Briefings in functional genomics & proteomics.

[16]  Limsoon Wong,et al.  BioKleisli: a digital library for biomedical researchers , 1997, International Journal on Digital Libraries.

[17]  Chris F. Taylor,et al.  Development of FuGO: an ontology for functional genomics investigations. , 2006, Omics : a journal of integrative biology.

[18]  Soyoung Ryu,et al.  MglA Regulates Francisella tularensis subsp. novicida (Francisella novicida) Response to Starvation and Oxidative Stress , 2007, Journal of bacteriology.

[19]  E. Birney,et al.  EnsMart: a generic system for fast and flexible access to biological data. , 2003, Genome research.

[20]  C. Pasquier Biological data integration using Semantic Web technologies. , 2008, Biochimie.

[21]  Kei-Hoi Cheung,et al.  AlzPharm: integration of neurodegeneration data using RDF , 2007, BMC Bioinformatics.

[22]  Robert Stevens,et al.  Wrapping and Interoperating Bioinformatics Resources Using CORBA , 2000, Briefings Bioinform..

[23]  T. Conway,et al.  Microarray expression profiling: capturing a genome‐wide portrait of the transcriptome , 2003, Molecular microbiology.

[24]  Ian Gorton,et al.  Architectures and technologies for enterprise application integration , 2004, Proceedings. 26th International Conference on Software Engineering.

[25]  Cynthia L. Smith,et al.  The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information , 2004, Genome Biology.

[26]  Golan Yona,et al.  BIOZON: a system for unification, management and analysis of heterogeneous biological data , 2006, BMC Bioinformatics.

[27]  R. Stoughton,et al.  Genetics of gene expression surveyed in maize, mouse and man , 2003, Nature.

[28]  Terrence A. Brooks,et al.  World Wide Web Consortium (W3C) , 2010 .

[29]  Na Zhang,et al.  A Francisella tularensis Pathogenicity Island Required for Intramacrophage Growth , 2004, Journal of bacteriology.

[30]  G. Schuler,et al.  Entrez: molecular biology database and retrieval system. , 1996, Methods in enzymology.

[31]  Sean Martin,et al.  Globally distributed object identification for biological knowledgebases , 2004, Briefings Bioinform..

[32]  S. Patterson Data analysis—the Achilles heel of proteomics , 2003, Nature Biotechnology.

[33]  M. Tyers,et al.  From genomics to proteomics , 2003, Nature.

[34]  Laura M. Haas,et al.  DiscoveryLink: A system for integrated access to life sciences data sources , 2001, IBM Syst. J..

[35]  Frank van Harmelen,et al.  Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema , 2002, SEMWEB.

[36]  David S. Weiss,et al.  Identification of MglA-Regulated Genes Reveals Novel Virulence Factors in Francisella tularensis , 2006, Infection and Immunity.

[37]  Stefan Decker,et al.  Framework for the Semantic Web: An RDF Tutorial , 2000, IEEE Internet Comput..

[38]  Jason E. Stewart,et al.  Minimum information about a microarray experiment (MIAME)—toward standards for microarray data , 2001, Nature Genetics.

[39]  M. Ashburner,et al.  The OBO Foundry: coordinated evolution of ontologies to support biomedical data integration , 2007, Nature Biotechnology.

[40]  Byron Gallis,et al.  Comparison of Francisella tularensis genomes reveals evolutionary events associated with the emergence of human pathogenic strains , 2007, Genome Biology.

[41]  J. Castle,et al.  An integrative genomics approach to infer causal associations between gene expression and disease , 2005, Nature Genetics.

[42]  Shahrokh Saeednia,et al.  How to maintain both privacy and authentication in digital libraries , 2000 .

[43]  Jeffrey R. Barker,et al.  Molecular and Genetic Basis of Pathogenesis in Francisella Tularensis , 2007, Annals of the New York Academy of Sciences.