Facing the Challenges of Data Integration in Biosciences

Data integration in molecular biology and clinical science has become imperative for providing the comprehensive information extraction in systems biology. In this review we evaluate the evolution and characteristics of biological databases and examine existing approaches to data integration in bioscience. Strengths and weaknesses of these approaches are identified by surveying several successful examples in biological data integration. We point out the challenges faced and possible solutions in biological data integration on various levels while contrasting the efforts of data integration in biosciences with those in industry. Index Terms data integration, federation, warehouse, and

[1]  A Finney,et al.  Systems biology markup language: Level 2 and beyond. , 2003, Biochemical Society transactions.

[2]  Laura M. Haas,et al.  DiscoveryLink: A system for integrated access to life sciences data sources , 2001, IBM Syst. J..

[3]  I-Min A Chen,et al.  An Overview of the Object-Protocol Model (OPM) and OPM Data Management Tools , 1995, Inf. Syst..

[4]  Kara Dolinski,et al.  Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms , 2004, Nucleic Acids Res..

[5]  C. Sander,et al.  Genomic medicine and the future of health care. , 2000, Science.

[6]  L. Wong,et al.  Technologies for Integrating Biological Data , 2002, Briefings Bioinform..

[7]  M. Vidal,et al.  Integrating 'omic' information: a bridge between genomics and systems biology. , 2003, Trends in genetics : TIG.

[8]  W. H. Inmon,et al.  Corporate Information Factory , 1998 .

[9]  Mushtaq Ahmed,et al.  Database Challenges in the Integration of Biomedical Data Sets , 2004, VLDB.

[10]  J. E. Kranz,et al.  YPD, PombePD and WormPD: model organism volumes of the BioKnowledge library, an integrated resource for protein information. , 2001, Nucleic acids research.

[11]  J. Bonfield,et al.  Finishing the euchromatic sequence of the human genome , 2004, Nature.

[12]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[13]  Xiaoshu Wang,et al.  From XML to RDF: how semantic web technologies will change the design of 'omic' standards , 2005, Nature Biotechnology.

[14]  Alon Y. Halevy,et al.  Data integration and genomic medicine , 2007, J. Biomed. Informatics.

[15]  Emmanuel Barillot,et al.  XML, bioinformatics and data integration , 2001, Bioinform..

[16]  Tatiana A. Tatusova,et al.  NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins , 2004, Nucleic Acids Res..

[17]  Sue Povey,et al.  Genew: the Human Gene Nomenclature Database , 2002, Nucleic Acids Res..

[18]  Carlos Alberto Heuser,et al.  Integrating Biological Databases , 2003, SBBD.

[19]  Michael V. Mannino,et al.  Database Design, Application Development, and Administration , 2011 .

[20]  Misha Angrist,et al.  Genomic medicine: genetic variation and its impact on the future of health care , 2005, Philosophical Transactions of the Royal Society B: Biological Sciences.

[21]  Michael Stonebraker,et al.  Object-Relational DBMSs: Tracking the Next Great Wave , 1998 .

[22]  Ramez Elmasri,et al.  Fundamentals of Database Systems , 1989 .

[23]  Terrence S. Furey,et al.  The UCSC Genome Browser Database , 2003, Nucleic Acids Res..

[24]  Gudmundur A. Thorisson,et al.  The International HapMap Project Web site. , 2005, Genome research.

[25]  Sue Povey,et al.  Genew: the Human Gene Nomenclature Database, 2004 updates , 2004, Nucleic Acids Res..

[26]  Rolf Apweiler,et al.  The EBI SRS server-new features , 2002, Bioinform..

[27]  Scott Gustafson,et al.  caCORE: A common infrastructure for cancer informatics , 2003, Bioinform..

[28]  Marek S. Skrzypek,et al.  YPDTM, PombePDTM and WormPDTM: model organism volumes of the BioKnowledgeTM Library, an integrated resource for protein information , 2001, Nucleic Acids Res..

[29]  P J Kersey,et al.  Integr8: Enhanced Inter-Operability of European Molecular Biology Databases , 2003, Methods of Information in Medicine.

[30]  Rolf Apweiler,et al.  The Integr8 project - a resource for genomic and proteomic data , 2004, Silico Biol..

[31]  Jonas S. Almeida,et al.  AGML Central: web based gel proteomic infrastructure , 2005, Bioinform..

[32]  Andrew Hayes,et al.  GIMS: an integrated data storage and analysis environment for genomic and functional data , 2003, Yeast.

[33]  Val Tannen,et al.  K2/Kleisli and GUS: Experiments in integrated access to genomic data sources , 2001, IBM Syst. J..

[34]  David L. Wheeler,et al.  GenBank , 2015, Nucleic Acids Res..

[35]  R. Altman,et al.  PharmGKB: the pharmacogenetics and pharmacogenomics knowledge base. , 2005, Methods in molecular biology.

[36]  Jason E. Stewart,et al.  Design and implementation of microarray gene expression markup language (MAGE-ML) , 2002, Genome Biology.

[37]  Russ B. Altman,et al.  PharmGKB: the Pharmacogenetics Knowledge Base , 2002, Nucleic Acids Res..

[38]  Masakazu Satou,et al.  A flexible representation of omic knowledge for thorough analysis of microarray data , 2006, Plant Methods.