Detection of Biochemical Pathways by Probabilistic Matching of Phyletic Vectors

A phyletic vector, also known as a phyletic (or phylogenetic) pattern, is a binary representation of the presences and absences of orthologous genes in different genomes. Joint occurrence of two or more genes in many genomes results in closely similar binary vectors representing these genes, and this similarity between gene vectors may be used as a measure of functional association between genes. Better understanding of quantitative properties of gene co-occurrences is needed for systematic studies of gene function and evolution. We used the probabilistic iterative algorithm Psi-square to find groups of similar phyletic vectors. An extended Psi-square algorithm, in which pseudocounts are implemented, shows better sensitivity in identifying proteins with known functional links than our earlier hierarchical clustering approach. At the same time, the specificity of inferring functional associations between genes in prokaryotic genomes is strongly dependent on the pathway: phyletic vectors of the genes involved in energy metabolism and in de novo biosynthesis of the essential precursors tend to be lumped together, whereas cellular modules involved in secretion, motility, assembly of cell surfaces, biosynthesis of some coenzymes, and utilization of secondary carbon sources tend to be identified with much greater specificity. It appears that the network of gene coinheritance in prokaryotes contains a giant connected component that encompasses most biosynthetic subsystems, along with a series of more independent modules involved in cell interaction with the environment.

[1]  M. Huynen,et al.  Practical and theoretical advances in predicting the function of a protein by its phylogenetic distribution , 2008, Journal of The Royal Society Interface.

[2]  B. Snel,et al.  Genomes in flux: the evolution of archaeal and proteobacterial gene content. , 2002, Genome research.

[3]  P. Forterre,et al.  Widespread distribution of archaeal reverse gyrase in thermophilic bacteria suggests a complex history of vertical inheritance and lateral gene transfers. , 2007, Archaea.

[4]  Galina V. Glazko,et al.  The choice of optimal distance measure in genome-wide datasets , 2005, Bioinform..

[5]  Simon Kasif,et al.  Identification of functional links between genes using phylogenetic profiles , 2003, Bioinform..

[6]  Christopher M. Bailey,et al.  Type VI secretion: a beginner's guide. , 2008, Current opinion in microbiology.

[7]  V. Gladyshev,et al.  Evolution of selenium utilization traits , 2005, Genome Biology.

[8]  D. Lipman,et al.  A genomic perspective on protein families. , 1997, Science.

[9]  Sophie Bleves,et al.  The bacterial type VI secretion machine: yet another player for protein transport across membranes. , 2008, Microbiology.

[10]  Patrick Forterre,et al.  A hot story from comparative genomics: reverse gyrase is the only hyperthermophile-specific protein. , 2002, Trends in genetics : TIG.

[11]  Dennis Shasha,et al.  Trait-to-Gene A Computational Method for Predicting the Function of Uncharacterized Genes , 2003, Current Biology.

[12]  H. Ochman,et al.  Stepwise formation of the bacterial flagellar system , 2007, Proceedings of the National Academy of Sciences.

[13]  C. Pál,et al.  Metabolic network analysis of the causes and evolution of enzyme dispensability in yeast , 2004, Nature.

[14]  Christian von Mering,et al.  STRING 7—recent developments in the integration and prediction of protein interactions , 2006, Nucleic Acids Res..

[15]  Teresa M. Przytycka,et al.  Discovering functional linkages and uncharacterized cellular pathways using phylogenetic profile comparisons: a comprehensive assessment , 2007, BMC Bioinformatics.

[16]  Peer Bork,et al.  Systematic Association of Genes to Phenotypes by Genome and Literature Mining , 2005, PLoS biology.

[17]  K. Jarrell,et al.  Prokaryotic motility structures. , 2003, Microbiology.

[18]  D. Eisenberg,et al.  Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[19]  G. Glazko,et al.  Detection of evolutionarily stable fragments of cellular pathways by hierarchical clustering of phyletic patterns , 2004, Genome Biology.

[20]  P. Bork,et al.  Non-orthologous gene displacement. , 1996, Trends in genetics : TIG.

[21]  Nikos Kyrpides,et al.  The Genomes On Line Database (GOLD) v.2: a monitor of genome projects worldwide , 2005, Nucleic Acids Res..

[22]  C. DeLisi,et al.  Deciphering protein network organization using phylogenetic profile groups. , 2005, Genome informatics. International Conference on Genome Informatics.

[23]  Mark Gerstein,et al.  Integration of curated databases to identify genotype-phenotype associations , 2006, BMC Genomics.

[24]  N. Grishin,et al.  COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. , 2003, Journal of molecular biology.

[25]  Michael K. Coleman,et al.  Similarity searches in genome-wide numerical data sets , 2006, Biology Direct.

[26]  Mona Singh,et al.  A cross-genomic approach for systematic mapping of phenotypic traits to genes. , 2003, Genome research.