Atlas – a data warehouse for integrative bioinformatics

BackgroundWe present a biological data warehouse called Atlas that locally stores and integrates biological sequences, molecular interactions, homology information, functional annotations of genes, and biological ontologies. The goal of the system is to provide data, as well as a software infrastructure for bioinformatics research and development.DescriptionThe Atlas system is based on relational data models that we developed for each of the source data types. Data stored within these relational models are managed through Structured Query Language (SQL) calls that are implemented in a set of Application Programming Interfaces (APIs). The APIs include three languages: C++, Java, and Perl. The methods in these API libraries are used to construct a set of loader applications, which parse and load the source datasets into the Atlas database, and a set of toolbox applications which facilitate data retrieval. Atlas stores and integrates local instances of GenBank, RefSeq, UniProt, Human Protein Reference Database (HPRD), Biomolecular Interaction Network Database (BIND), Database of Interacting Proteins (DIP), Molecular Interactions Database (MINT), IntAct, NCBI Taxonomy, Gene Ontology (GO), Online Mendelian Inheritance in Man (OMIM), LocusLink, Entrez Gene and HomoloGene. The retrieval APIs and toolbox applications are critical components that offer end-users flexible, easy, integrated access to this data. We present use cases that use Atlas to integrate these sources for genome annotation, inference of molecular interactions across species, and gene-disease associations.ConclusionThe Atlas biological data warehouse serves as data infrastructure for bioinformatics research and development. It forms the backbone of the research activities in our laboratory and facilitates the integration of disparate, heterogeneous biological sources of data enabling new scientific inferences. Atlas achieves integration of diverse data sets at two levels. First, Atlas stores data of similar types using common data models, enforcing the relationships between data types. Second, integration is achieved through a combination of APIs, ontology, and tools. The Atlas software is freely available under the GNU General Public License at: http://bioinformatics.ubc.ca/atlas/

[1]  Dipanwita Roy Chowdhury,et al.  Human protein reference database as a discovery resource for proteomics , 2004, Nucleic Acids Res..

[2]  Cathy H. Wu,et al.  UniProt: the Universal Protein knowledgebase , 2004, Nucleic Acids Res..

[3]  Y. Wan,et al.  Receptor reserve analysis of the human CCR3 receptor in eosinophils and CCR3‐transfected cells , 2000, Journal of leukocyte biology.

[4]  E. Birney,et al.  EnsMart: a generic system for fast and flexible access to biological data. , 2003, Genome research.

[5]  Tao Xu,et al.  Pegasys: software for executing and integrating analyses of biological sequences , 2004, BMC Bioinformatics.

[6]  Hideaki Sugawara,et al.  DBJ in the stream of various biological data , 2004, Nucleic Acids Res..

[7]  David Botstein,et al.  The Stanford Microarray Database: data access and quality assessment tools , 2003, Nucleic Acids Res..

[8]  H. Nomiyama,et al.  Molecular Cloning of a Novel Human CC Chemokine (Eotaxin-3) That Is a Functional Ligand of CC Chemokine Receptor 3* , 1999, The Journal of Biological Chemistry.

[9]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[10]  Donna R. Maglott,et al.  RefSeq and LocusLink: NCBI gene-centered resources , 2001, Nucleic Acids Res..

[11]  David L. Wheeler,et al.  GenBank: update , 2004, Nucleic Acids Res..

[12]  Carole A. Goble,et al.  TAMBIS: Transparent Access to Multiple Bioinformatics Information Sources , 1998, ISMB.

[13]  Adam J. Smith,et al.  The Database of Interacting Proteins: 2004 update , 2004, Nucleic Acids Res..

[14]  C. Sander,et al.  The HUPO PSI's Molecular Interaction format—a community standard for the representation of protein interaction data , 2004, Nature Biotechnology.

[15]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[16]  Terrence S. Furey,et al.  The UCSC Genome Browser Database , 2003, Nucleic Acids Res..

[17]  Carole A. Goble,et al.  An ontology for bioinformatics applications , 1999, Bioinform..

[18]  V. McKusick Mendelian inheritance in man , 1971 .

[19]  Gary D. Bader,et al.  BIND-a data specification for storing and describing biomolecular interactions, molecular complexes and pathways , 2000, Bioinform..

[20]  Ioannis Xenarios,et al.  DIP: The Database of Interacting Proteins: 2001 update , 2001, Nucleic Acids Res..

[21]  Eric S. Lander,et al.  Identification of a gene causing human cytochrome c oxidase deficiency by integrative genomics , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[22]  C. Ball,et al.  Microarray databases: standards and ontologies , 2002, Nature Genetics.

[23]  Lei Zhou,et al.  JXP4BIGI: a generalized, Java XML-based approach for biological information gathering and integration , 2003, Bioinform..

[24]  Evelyn Camon,et al.  The EMBL Nucleotide Sequence Database , 2000, Nucleic Acids Res..

[25]  Ian M. Donaldson,et al.  BIND: the Biomolecular Interaction Network Database , 2001, Nucleic Acids Res..

[26]  Mark D. Wilkinson,et al.  BioMOBY: An Open Source Biological Web Services Proposal , 2002, Briefings Bioinform..

[27]  Joshua M. Stuart,et al.  A Gene-Coexpression Network for Global Discovery of Conserved Genetic Modules , 2003, Science.

[28]  Gary D. Bader,et al.  SeqHound: biological sequence and structure database as a platform for bioinformatics research , 2002, BMC Bioinformatics.

[29]  M. Kanehisa,et al.  DBGET/LinkDB: an integrated database retrieval system. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[30]  E. Birney,et al.  Apollo: a sequence annotation editor , 2002, Genome Biology.

[31]  Martin Vingron,et al.  IntAct: an open source molecular interaction database , 2004, Nucleic Acids Res..

[32]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.