An Integrative Genomic Approach to Uncover Molecular Mechanisms of Prokaryotic Traits

With mounting availability of genomic and phenotypic databases, data integration and mining become increasingly challenging. While efforts have been put forward to analyze prokaryotic phenotypes, current computational technologies either lack high throughput capacity for genomic scale analysis, or are limited in their capability to integrate and mine data across different scales of biology. Consequently, simultaneous analysis of associations among genomes, phenotypes, and gene functions is prohibited. Here, we developed a high throughput computational approach, and demonstrated for the first time the feasibility of integrating large quantities of prokaryotic phenotypes along with genomic datasets for mining across multiple scales of biology (protein domains, pathways, molecular functions, and cellular processes). Applying this method over 59 fully sequenced prokaryotic species, we identified genetic basis and molecular mechanisms underlying the phenotypes in bacteria. We identified 3,711 significant correlations between 1,499 distinct Pfam and 63 phenotypes, with 2,650 correlations and 1,061 anti-correlations. Manual evaluation of a random sample of these significant correlations showed a minimal precision of 30% (95% confidence interval: 20%–42%; n = 50). We stratified the most significant 478 predictions and subjected 100 to manual evaluation, of which 60 were corroborated in the literature. We furthermore unveiled 10 significant correlations between phenotypes and KEGG pathways, eight of which were corroborated in the evaluation, and 309 significant correlations between phenotypes and 166 GO concepts evaluated using a random sample (minimal precision = 72%; 95% confidence interval: 60%–80%; n = 50). Additionally, we conducted a novel large-scale phenomic visualization analysis to provide insight into the modular nature of common molecular mechanisms spanning multiple biological scales and reused by related phenotypes (metaphenotypes). We propose that this method elucidates which classes of molecular mechanisms are associated with phenotypes or metaphenotypes and holds promise in facilitating a computable systems biology approach to genomic and biomedical research.

[1]  Yves A. Lussier,et al.  Terminological Mapping for High Throughput Comparative Biology of Phenotypes , 2003, Pacific Symposium on Biocomputing.

[2]  Yves A. Lussier,et al.  An integrative model for in-silico clinical-genomics discovery science , 2002, AMIA.

[3]  Y Takatsuka,et al.  Identification of the amino acid residues conferring substrate specificity upon Selenomonas ruminantium lysine decarboxylase. , 1999, Bioscience, biotechnology, and biochemistry.

[4]  Hiroyuki Ogata,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 1999, Nucleic Acids Res..

[5]  Christopher Ahlberg,et al.  Spotfire: an information exploration environment , 1996, SGMD.

[6]  S. Edberg Global Infectious Diseases and Epidemiology Network (GIDEON): a world wide Web-based program for diagnosis and informatics in infectious diseases. , 2005, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[7]  Susumu Goto,et al.  The KEGG resource for deciphering the genome , 2004, Nucleic Acids Res..

[8]  Leon Goldovsky,et al.  The net of life: reconstructing the microbial phylogenetic network. , 2005, Genome research.

[9]  F. Young Biochemistry , 1955, The Indian Medical Gazette.

[10]  Mark Gerstein,et al.  Integration of curated databases to identify genotype-phenotype associations , 2006, BMC Genomics.

[11]  D. Edwards Data Mining: Concepts, Models, Methods, and Algorithms , 2003 .

[12]  James E. Bray,et al.  The CATH database: an extended protein family resource for structural and functional genomics , 2003, Nucleic Acids Res..

[13]  Peter J. Tonellato,et al.  The Rat Genome Database (RGD): developments towards a phenome database , 2004, Nucleic Acids Res..

[14]  Mikael Skurnik,et al.  The biosynthesis and biological role of lipopolysaccharide O-antigens of pathogenic Yersiniae. , 2003, Carbohydrate research.

[15]  Y. Pouliot,et al.  DIAN: a novel algorithm for genome ontological classification. , 2001, Genome research.

[16]  Susumu Goto,et al.  KEGG: Kyoto Encyclopedia of Genes and Genomes , 2000, Nucleic Acids Res..

[17]  J. Davies,et al.  Molecular Biology of the Cell , 1983, Bristol Medico-Chirurgical Journal.

[18]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Peer Bork,et al.  Systematic Association of Genes to Phenotypes by Genome and Literature Mining , 2005, PLoS biology.

[20]  A. Butte,et al.  Creation and implications of a phenome-genome network , 2006, Nature Biotechnology.

[21]  D. Barrell,et al.  The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. , 2003, Genome research.

[22]  J L Sussman,et al.  Protein Data Bank (PDB): database of three-dimensional structural information of biological macromolecules. , 1998, Acta crystallographica. Section D, Biological crystallography.

[23]  Mona Singh,et al.  A cross-genomic approach for systematic mapping of phenotypic traits to genes. , 2003, Genome research.

[24]  David Botstein,et al.  GO: : TermFinder--open source software for accessing Gene Ontology information and finding significantly enriched Gene Ontology terms associated with a list of genes , 2004, Bioinform..

[25]  Tao Jiang,et al.  Understanding microbial genomic structures and applications to biological pathway inference , 2006, 2006 IEEE International Conference on Granular Computing.

[26]  Robert W. Williams,et al.  WebQTL - Web-based complex trait analysis , 2003, Neuroinformatics.

[27]  Owen White,et al.  Genome Properties: a system for the investigation of prokaryotic genetic content for microbiology, genome annotation and comparative genomics , 2005, Bioinform..

[28]  M Saffer,et al.  Chronic secretory otitis media: negative bacteriology. , 1996, Acta oto-laryngologica.

[29]  Masaru Tomita,et al.  GEM System: automatic prototyping of cell-wide metabolic pathway models from genomes , 2006, BMC Bioinformatics.

[30]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[31]  Bassem A. Hassan,et al.  Gene prioritization through genomic data fusion , 2006, Nature Biotechnology.

[32]  Ying Xu,et al.  Mapping of microbial pathways through constrained mapping of orthologous genes , 2004, Proceedings. 2004 IEEE Computational Systems Bioinformatics Conference, 2004. CSB 2004..

[33]  B. Alberts,et al.  Molecular Biology of the Cell 4th edition , 2007 .

[34]  C. Ginocchio,et al.  Molecular and functional characterization of the Salmonella invasion gene invA: homology of InvA to members of a new protein family , 1992, Journal of bacteriology.

[35]  Stephen A Berger,et al.  GIDEON: a comprehensive Web-based resource for geographic medicine , 2005, International journal of health geographics.

[36]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[37]  W. Gump,et al.  Intrathecal colistin for treatment of highly resistant Pseudomonas ventriculitis. Case report and review of the literature. , 2005, Journal of neurosurgery.

[38]  Peter D. Karp,et al.  The EcoCyc and MetaCyc databases , 2000, Nucleic Acids Res..

[39]  Peter D. Karp,et al.  The MetaCyc Database , 2002, Nucleic Acids Res..

[40]  Michael Lappe,et al.  A fully automatic evolutionary classification of protein folds: Dali Domain Dictionary version 3 , 2001, Nucleic Acids Res..

[41]  M. Pagel Inferring the historical patterns of biological evolution , 1999, Nature.

[42]  Judith A. Blake,et al.  The Mouse Genome Database (MGD): from genes to mice—a community resource for mouse biology , 2004, Nucleic Acids Res..

[43]  Tim J. P. Hubbard,et al.  SCOP: a structural classification of proteins database , 1998, Nucleic Acids Res..

[44]  F. James Rohlf,et al.  Biometry: The Principles and Practice of Statistics in Biological Research , 1969 .

[45]  Emily Dimmer,et al.  The Gene Ontology Annotation (GOA) Database - An integrated resource of GO annotations to the UniProt Knowledgebase , 2003, Silico Biol..

[46]  Shin-Ichi Aizawa,et al.  Type III secretion systems and bacterial flagella: Insights into their function from structural similarities , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[47]  R. Poole,et al.  Microbial ubiquinones: multiple roles in respiration, gene regulation and oxidative stress management. , 1999, Microbiology.

[48]  Yan P. Yuan,et al.  Predicting function: from genes to genomes and back. , 1998, Journal of molecular biology.

[49]  Natalia Maltsev,et al.  WIT: integrated system for high-throughput genome sequence analysis and metabolic reconstruction , 2000, Nucleic Acids Res..

[50]  E. Koonin,et al.  Potential genomic determinants of hyperthermophily. , 2003, Trends in genetics : TIG.