Heterogeneous biological data integration with declarative query language

The requirements for scalable data integration systems for modern biology are indisputable, due to the very large, heterogeneous, and complex datasets available in public databases. The management and fusion of this "big data" with local databases represents a major challenge, since it underlies the computational inferences and models that will be subsequently generated and validated experimentally. In this paper, we present an alternative conception for local data integration, called BIRD (Biological Integration and Retrieval Data), based on four concepts: (i) a hybrid flat file and relational database architecture permits the rapid management of large volumes of heterogeneous datasets; (ii) a generic data model allows the simultaneous organization and classification of local databases according to real-world requirements; (iii) configuration rules are used to divide and map each data resource into several data model entities; and (iv) a simple, declarative query language (BIRD-QL) facilitates information extraction from heterogeneous datasets. This flexible, generic design allows the integration of diverse data formats in a searchable database with high-level functionalities depending on the specific scientific context. It has been validated in the context of real world projects, notably the SM2PH (Structural Mutation to the Phenotypes of Human Pathologies) project.

[1]  Olivier Poch,et al.  Décrypthon Grid - Grid Resources Dedicated to Neuromuscular Disorders , 2010, HealthGrid.

[2]  Laura M. Haas,et al.  DiscoveryLink: A system for integrated access to life sciences data sources , 2001, IBM Syst. J..

[3]  L. Michel,et al.  SAADA: An Automatic Archival System for Astronomical Data , 2004 .

[4]  Govert Schilling The Virtual Observatory Moves Closer to Reality , 2000, Science.

[5]  Jan-Eric Litton,et al.  BIMS: An information management system for biobanking in the 21st century , 2007, IBM Syst. J..

[6]  Morris A. Swertz,et al.  VarioML framework for comprehensive variation data representation and exchange , 2012, BMC Bioinformatics.

[7]  V. Marx Biology: The big challenges of big data , 2013, Nature.

[8]  Olivier Poch,et al.  MSV3d: database of human MisSense variants mapped to 3D protein structure , 2012, Database J. Biol. Databases Curation.

[9]  Aminul Islam,et al.  The Power of Declarative Languages: A Comparative Exposition of Scientific Workflow Design Using BioFlow and Taverna , 2009, 2009 Congress on Services - I.

[10]  Haiyuan Yu,et al.  Network-based methods for human disease gene prediction. , 2011, Briefings in functional genomics.

[11]  John Boyle,et al.  Biology must develop its own big-data systems , 2013, Nature.

[12]  Martin Senger,et al.  BioMoby extensions to the Taverna workflow management and enactment software , 2006, BMC Bioinformatics.

[13]  Ngoc Hoan Nguyen Conception et réalisation d'un générateur de bases de données astronomiques: Saada , 2006 .

[14]  Melanie I. Stefan,et al.  BioModels Database: An enhanced, curated and annotated resource for published quantitative kinetic models , 2010, BMC Systems Biology.

[15]  Olivier Poch,et al.  Knowledge Discovery in Variant Databases Using Inductive Logic Programming , 2013, Bioinformatics and biology insights.

[16]  A. Nekrutenko,et al.  Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences , 2010, Genome Biology.

[17]  Alvis Brazma,et al.  Minimum Information About a Microarray Experiment (MIAME) – Successes, Failures, Challenges , 2009, TheScientificWorldJournal.

[18]  Aminul Islam,et al.  Managing and querying gene expression data using Curray , 2011, BMC proceedings.

[19]  Carole A. Goble,et al.  The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud , 2013, Nucleic Acids Res..

[20]  J. Rashbass Online Mendelian Inheritance in Man. , 1995, Trends in genetics : TIG.

[21]  Hasan M. Jamil,et al.  BioFlow: A Web-Based Declarative Workflow Language for Life Sciences , 2008, 2008 IEEE Congress on Services - Part I.

[22]  Jacob Köhler,et al.  Addressing the problems with life-science databases for traditional uses and systems biology , 2006, Nature Reviews Genetics.

[23]  Martin Eisenacher,et al.  The HUPO proteomics standards initiative- mass spectrometry controlled vocabulary , 2013, Database J. Biol. Databases Curation.

[24]  Arek Kasprzyk,et al.  BioMart: driving a paradigm change in biological data management , 2011, Database J. Biol. Databases Curation.

[25]  Thure Etzold,et al.  SRS - an indexing and retrieval tool for flat file data libraries , 1993, Comput. Appl. Biosci..

[26]  Sándor Pongor,et al.  JBioWH: an open-source Java framework for bioinformatics data integration , 2013, Database J. Biol. Databases Curation.

[27]  Priyanka Gupta,et al.  BioWarehouse: a bioinformatics database warehouse toolkit , 2006, BMC Bioinformatics.

[28]  Olivier Poch,et al.  SM2PH‐db: an interactive system for the integrated analysis of phenotypic consequences of missense mutations in proteins involved in human genetic diseases , 2010, Human mutation.

[29]  Vania Bogorny,et al.  ST‐DMQL: A Semantic Trajectory Data Mining Query Language , 2009, Int. J. Geogr. Inf. Sci..

[30]  Olivier Poch,et al.  KD4v: comprehensible knowledge discovery system for missense variant , 2012, Nucleic Acids Res..

[31]  F.-X. Pineau,et al.  Building Astronomical Databases with Saada , 2010 .