Beyond the data deluge: Data integration and bio-ontologies

Biomedical research is increasingly a data-driven science. New technologies support the generation of genome-scale data sets of sequences, sequence variants, transcripts, and proteins; genetic elements underpinning understanding of biomedicine and disease. Information systems designed to manage these data, and the functional insights (biological knowledge) that come from the analysis of these data, are critical to mining large, heterogeneous data sets for new biologically relevant patterns, to generating hypotheses for experimental validation, and ultimately, to building models of how biological systems work. Bio-ontologies have an essential role in supporting two key approaches to effective interpretation of genome-scale data sets: data integration and comparative genomics. To date, bio-ontologies such as the Gene Ontology have been used primarily in community genome databases as structured controlled terminologies and as data aggregators. In this paper we use the Gene Ontology (GO) and the Mouse Genome Informatics (MGI) database as use cases to illustrate the impact of bio-ontologies on data integration and for comparative genomics. Despite the profound impact ontologies are having on the digital categorization of biological knowledge, new biomedical research and the expanding and changing nature of biological information have limited the development of bio-ontologies to support dynamic reasoning for knowledge discovery.

[1]  T. Beccari,et al.  Specificity of mouse GM2 activator protein and beta-N-acetylhexosaminidases A and B. Similarities and differences with their human counterparts in the catabolism of GM2. , 1998, The Journal of biological chemistry.

[2]  A. Rector,et al.  Relations in biomedical ontologies , 2005, Genome Biology.

[3]  Andrew Smith Genome sequence of the nematode C-elegans: A platform for investigating biology , 1998 .

[4]  N McDonell,et al.  Adenoviral gene therapy of the Tay-Sachs disease in hexosaminidase A-deficient knock-out mice. , 1999, Human molecular genetics.

[5]  Judith A. Blake,et al.  The Mouse Genome Database (MGD): integration nexus for the laboratory mouse , 2001, Nucleic Acids Res..

[6]  Stephen M. Mount,et al.  The genome sequence of Drosophila melanogaster. , 2000, Science.

[7]  D. Mccormick Sequence the Human Genome , 1986, Bio/Technology.

[8]  Carol J. Bult Data integration standards in model organisms: from genotype to phenotype in the laboratory mouse , 2002 .

[9]  B. Barrell,et al.  The genome sequence of Schizosaccharomyces pombe , 2002, Nature.

[10]  R. Fleischmann,et al.  Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. , 1995, Nature.

[11]  Carole A. Goble,et al.  Ontology-based Knowledge Representation for Bioinformatics , 2000, Briefings Bioinform..

[12]  P. Gressens,et al.  Disruption of murine Hexa gene leads to enzymatic deficiency and to neuronal lysosomal storage, similar to that observed in Tay-Sachs disease , 1995, Mammalian Genome.

[13]  J. Berg Genome sequence of the nematode C. elegans: a platform for investigating biology. , 1998, Science.

[14]  Mark R Chance,et al.  Structural proteomics of macromolecular assemblies using oxidative footprinting and mass spectrometry. , 2005, Trends in biochemical sciences.

[15]  Volker Briken,et al.  Dynamics of Major Histocompatibility Complex Class II Compartments during B Cell Receptor–mediated Cell Activation , 2002, The Journal of experimental medicine.

[16]  L. Ohno-Machado Journal of Biomedical Informatics , 2001 .

[17]  Jared C Roach,et al.  Application of affymetrix array and massively parallel signature sequencing for identification of genes involved in prostate cancer progression , 2005, BMC Cancer.

[18]  J. V. Moran,et al.  Initial sequencing and analysis of the human genome. , 2001, Nature.

[19]  B. Frey,et al.  The functional landscape of mouse gene expression , 2004, Journal of biology.

[20]  N Hanai,et al.  Dramatically different phenotypes in mouse models of human Tay-Sachs and Sandhoff diseases. , 1996, Human molecular genetics.

[21]  Mary E. Mangan,et al.  The Adult Mouse Anatomical Dictionary: a tool for annotating and integrating data , 2005, Genome Biology.

[22]  Judith A. Blake,et al.  Mouse genome informatics in a new age of biological inquiry , 2000, Proceedings IEEE International Symposium on Bio-Informatics and Biomedical Engineering.

[23]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[24]  Boris Lenhard,et al.  Arrays of ultraconserved non-coding regions span the loci of key developmental genes in vertebrate genomes , 2004, BMC Genomics.

[25]  Cynthia L. Smith,et al.  The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information , 2004, Genome Biology.

[26]  K Suzuki,et al.  Targeted disruption of the Hexa gene results in mice with biochemical and pathologic features of Tay-Sachs disease. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Michael P. McDonald,et al.  Mice lacking both subunits of lysosomal β–hexosaminidase display gangliosidosis and mucopolysaccharidosis , 1996, Nature Genetics.

[28]  Steven P Gygi,et al.  Comparative evaluation of mass spectrometry platforms used in large-scale proteomics investigations , 2005, Nature Methods.

[29]  Emily Dimmer,et al.  An evaluation of GO annotation retrieval for BioCreAtIvE and GOA , 2005, BMC Bioinformatics.

[30]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[31]  J. Bard,et al.  Ontologies in biology: design, applications and future challenges , 2004, Nature Reviews Genetics.

[32]  José L. V. Mejino,et al.  A reference ontology for biomedical informatics: the Foundational Model of Anatomy , 2003, J. Biomed. Informatics.

[33]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[34]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.