Scaling the walls of discovery: using semantic metadata for integrative problem solving

Current data integration approaches by bioinformaticians frequently involve extracting data from a wide variety of public and private data repositories, each with a unique vocabulary and schema, via scripts. These separate data sets must then be normalized through the tedious and lengthy process of resolving naming differences and collecting information into a single view. Attempts to consolidate such diverse data using data warehouses or federated queries add significant complexity and have shown limitations in flexibility. The alternative of complete semantic integration of data requires a massive, sustained effort in mapping data types and maintaining ontologies. We focused instead on creating a data architecture that leverages semantic mapping of experimental metadata, to support the rapid prototyping of scientific discovery applications with the twin goals of reducing architectural complexity while still leveraging semantic technologies to provide flexibility, efficiency and more fully characterized data relationships. A metadata ontology was developed to describe our discovery process. A metadata repository was then created by mapping metadata from existing data sources into this ontology, generating RDF triples to describe the entities. Finally an interface to the repository was designed which provided not only search and browse capabilities but complex query templates that aggregate data from both RDF and RDBMS sources. We describe how this approach (i) allows scientists to discover and link relevant data across diverse data sources and (ii) provides a platform for development of integrative informatics applications.

[1]  Bart De Moor,et al.  BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis , 2005, Bioinform..

[2]  L. Stein,et al.  OWL Web Ontology Language - Reference , 2004 .

[3]  Haruki Nakamura,et al.  The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data , 2006, Nucleic Acids Res..

[4]  Andreas D Baxevanis,et al.  Searching the NCBI Databases Using Entrez , 2006, Current protocols in human genetics.

[5]  Eric P. Hoffman,et al.  The PEPR GeneChip data warehouse, and implementation of a dynamic time series query tool (SGQT) with graphical interface , 2004, Nucleic Acids Res..

[6]  Yu Liang An expression meta-analysis of predicted microRNA targets identifies a diagnostic signature for lung cancer , 2008, BMC Medical Genomics.

[7]  Mads Thomassen,et al.  Gene expression meta-analysis identifies chromosomal regions and candidate genes involved in breast cancer metastasis , 2008, Breast Cancer Research and Treatment.

[8]  M. Severgnini,et al.  Strategies for comparing gene expression profiles from different microarray platforms: application to a case-control experiment. , 2006, Analytical biochemistry.

[9]  Spyro Mousses,et al.  A transforming mutation in the pleckstrin homology domain of AKT1 in cancer , 2007, Nature.

[10]  P. Balasubramanie,et al.  Wavelet Feature Based Neural Classifier System for Object Classification with Complex Background , 2007, International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007).

[11]  Olga Brazhnik,et al.  Anatomy of data integration , 2007, J. Biomed. Informatics.

[12]  C. Pasquier Biological data integration using Semantic Web technologies. , 2008, Biochimie.

[13]  Jeremy J. Carroll,et al.  Resource description framework (rdf) concepts and abstract syntax , 2003 .

[14]  Philip E. Bourne,et al.  The Protein Data Bank and lessons in data management , 2004, Briefings Bioinform..

[15]  T. N. Bhat,et al.  The Protein Data Bank: unifying the archive , 2002, Nucleic Acids Res..

[16]  Richard Côté,et al.  The PRIDE proteomics identifications database: data submission, query, and dataset comparison. , 2008, Methods in molecular biology.

[17]  John Quackenbush,et al.  Multiple-laboratory comparison of microarray platforms , 2005, Nature Methods.

[18]  Omran A. Bukhres,et al.  BACIIS: Biological and Chemical Information Integration System , 2005, J. Database Manag..

[19]  Z. Szabadka,et al.  Building a Structured PDB: The RS-PDB Database , 2006, 2006 International Conference of the IEEE Engineering in Medicine and Biology Society.

[20]  Subbarao Kambhampati,et al.  Integration of biological sources: current systems and challenges ahead , 2004, SGMD.

[21]  G. Schuler,et al.  Entrez: molecular biology database and retrieval system. , 1996, Methods in enzymology.

[22]  Damian Smedley,et al.  BioMart – biological queries made easy , 2009, BMC Genomics.

[23]  Haruki Nakamura,et al.  Realism about PDB , 2007, Nature Biotechnology.

[24]  Ross D King,et al.  Overhauling the PDB , 2007, Nature Biotechnology.

[25]  M. Kanehisa,et al.  DBGET/LinkDB: an integrated database retrieval system. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[26]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[27]  Sean Martin,et al.  Globally distributed object identification for biological knowledgebases , 2004, Briefings Bioinform..

[28]  Valencia-GarcíaRafael,et al.  Combining Semantic Web technologies with Multi-Agent Systems for integrated access to biological resources , 2008 .

[29]  Seán I. O'Donoghue,et al.  The SRS 3D module: integrating structures, sequences and features , 2004, Bioinform..

[30]  Lynda Hardman,et al.  /facet: A Browser for Heterogeneous Semantic Web Repositories , 2006, SEMWEB.

[31]  Philip E. Bourne,et al.  STAR/mmCIF: An ontology for macromolecular structure , 2000, Bioinform..

[32]  Kimberly Van Auken,et al.  WormBase 2007 , 2007, Nucleic Acids Res..

[33]  Zhi Hu,et al.  An integrative genomic and proteomic analysis of PIK3CA, PTEN, and AKT mutations in breast cancer. , 2008, Cancer research.

[34]  Nicole Tourigny,et al.  Bio2RDF: Towards a mashup to build bioinformatics knowledge systems , 2008, J. Biomed. Informatics.

[35]  D. Ponmary Pushpa Latha,et al.  Generation of unified data structure and data warehouse for protein data banks , 2007, International Conference on Computational Intelligence and Multimedia Applications (ICCIMA 2007).

[36]  James A. Hendler,et al.  The Semantic Web" in Scientific American , 2001 .

[37]  Gerben J Schaaf,et al.  Scaling of gene expression data allowing the comparison of different gene expression platforms. , 2008, Methods in molecular biology.

[38]  Wen-Lin Kuo,et al.  A collection of breast cancer cell lines for the study of functionally distinct cancer subtypes. , 2006, Cancer cell.

[39]  Kei-Hoi Cheung,et al.  LinkHub: a Semantic Web system that facilitates cross-database queries and information retrieval in proteomics , 2007, BMC Bioinformatics.

[40]  Balazs Györffy,et al.  Meta-analysis of gene expression profiles related to relapse-free survival in 1,079 breast cancer patients , 2009, Breast Cancer Research and Treatment.

[41]  T. Barrette,et al.  Oncomine 3.0: genes, pathways, and networks in a collection of 18,000 cancer gene expression profiles. , 2007, Neoplasia.

[42]  Carole A. Goble,et al.  State of the nation in data integration for bioinformatics , 2008, J. Biomed. Informatics.

[43]  Amit P. Sheth,et al.  An ontology-driven semantic mashup of gene and biological pathway information: Application to the domain of nicotine dependence , 2008, J. Biomed. Informatics.

[44]  C. Bizer,et al.  D2R MAP - A Database to RDF Mapping Language , 2003, WWW.

[45]  Helen M Berman,et al.  The Impact of Structural Genomics on the Protein Data Bank , 2004, American journal of pharmacogenomics : genomics-related research in drug development and clinical practice.

[46]  E. Birney,et al.  EnsMart: a generic system for fast and flexible access to biological data. , 2003, Genome research.

[47]  Haruki Nakamura,et al.  Remediation of the protein data bank archive , 2007, Nucleic Acids Res..

[48]  Rafael Valencia-García,et al.  Combining Semantic Web technologies with Multi-Agent Systems for integrated access to biological resources , 2008, J. Biomed. Informatics.

[49]  Qing Zhang,et al.  The RCSB Protein Data Bank: a redesigned query system and relational database based on the mmCIF schema , 2004, Nucleic Acids Res..

[50]  P. Argos,et al.  SRS: information retrieval system for molecular biology data banks. , 1996, Methods in enzymology.

[51]  M. Rubin,et al.  Integrative biology of prostate cancer progression. , 2006, Annual review of pathology.

[52]  Roy T. Fielding,et al.  Uniform Resource Identifiers (URI): Generic Syntax , 1998, RFC.

[53]  C. Sotiriou,et al.  Meta-analysis of gene expression profiles in breast cancer: toward a unified understanding of breast cancer subtyping and prognosis signatures , 2007, Breast Cancer Research.

[54]  C R Kissinger,et al.  Deposit3D: a tool for automating structure depositions to the Protein Data Bank. , 2005, Acta crystallographica. Section F, Structural biology and crystallization communications.

[55]  A. Schulze,et al.  Navigating gene expression using microarrays — a technology review , 2001, Nature Cell Biology.

[56]  Kei-Hoi Cheung,et al.  Advancing translational research with the Semantic Web , 2007, BMC Bioinformatics.

[57]  Helena F. Deus,et al.  A Semantic Web Management Model for Integrative Biomedical Informatics , 2008, PloS one.