Genome phylogenetic analysis based on extended gene contents.

With the rapid growth of entire genome data, whole-genome approaches such as gene content become popular for genome phylogeny inference, including the tree of life. However, the underlying model for genome evolution is unclear, and the proposed (ad hoc) genome distance measure may violate the additivity. In this article, we formulate a stochastic framework for genome evolution, which provides a basis for defining an additive genome distance. However, we show that it is difficult to utilize the typical gene content data-i.e., the presence or absence of gene families across genomes-to estimate the genome distance. We solve this problem by introducing the concept of extended gene content; that is, the status of a gene family in a given genome could be absence, presence as single copy, or presence as duplicates, any of which can be used to estimate the genome distance and phylogenetic inference. Computer simulation shows that the new tree-making method is efficient, consistent, and fairly robust. The example of 35 microbial complete genomes demonstrates that it is useful not only to study the universal tree of life but also to explore the evolutionary pattern of genomes.

[1]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[2]  R. Overbeek,et al.  The winds of (evolutionary) change: breathing new life into microbiology , 1994 .

[3]  W. Li,et al.  Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. , 1995, Molecular biology and evolution.

[4]  G. B. Golding,et al.  Protein-based phylogenies support a chimeric origin for the eukaryotic genome. , 1995, Molecular biology and evolution.

[5]  W. Li,et al.  A general additive distance with time-reversibility and rate variation among nucleotide sites. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[6]  M. Nei,et al.  Evolution by the birth-and-death process in multigene families of the vertebrate immune system. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[7]  T. Gruber,et al.  Molecular systematic studies of eubacteria, using sigma70-type sigma factors of group 1 and group 2 , 1997, Journal of bacteriology.

[8]  Radhey S. Gupta Protein Phylogenies and Signature Sequences: A Reappraisal of Evolutionary Relationships among Archaebacteria, Eubacteria, and Eukaryotes , 1998, Microbiology and Molecular Biology Reviews.

[9]  C. Woese The universal ancestor. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[10]  H. Ochman,et al.  Molecular archaeology of the Escherichia coli genome. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[11]  John M. Logsdon,et al.  Archaeal genomics: Do archaea have a mixed heritage? , 1998, Current Biology.

[12]  Directional mutational pressure affects the amino acid composition and hydrophobicity of proteins in bacteria , 1998 .

[13]  M. Huynen,et al.  The frequency distribution of gene family sizes in complete genomes. , 1998, Molecular biology and evolution.

[14]  S. Salzberg,et al.  Evidence for lateral gene transfer between Archaea and Bacteria from genome sequence of Thermotoga maritima , 1999, Nature.

[15]  B. Dujon,et al.  The genomic tree as revealed from whole proteome comparisons. , 1999, Genome research.

[16]  Doolittle Wf Phylogenetic Classification and the Universal Tree , 1999 .

[17]  J. Lake,et al.  Horizontal gene transfer among genomes: the complexity hypothesis. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[18]  L. Orgel,et al.  Phylogenetic Classification and the Universal Tree , 1999 .

[19]  S. Fitz-Gibbon,et al.  Whole genome-based phylogenetic analysis of free-living microorganisms. , 1999, Nucleic acids research.

[20]  X. Gu,et al.  Statistical methods for testing functional divergence after gene duplication. , 1999, Molecular biology and evolution.

[21]  B. Snel,et al.  Genome phylogeny based on gene content , 1999, Nature Genetics.

[22]  D J Lipman,et al.  Lineage-specific loss and divergence of functionally linked genes in eukaryotes. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[23]  J. Eisen Horizontal gene transfer among microbial genomes: new insights from complete genome analysis. , 2000, Current opinion in genetics & development.

[24]  Michael Y. Galperin,et al.  The COG database: a tool for genome-scale analysis of protein functions and evolution , 2000, Nucleic Acids Res..

[25]  Michael Y. Galperin,et al.  Towards understanding the first genome sequence of a crenarchaeon by genome annotation using clusters of orthologous groups of proteins (COGs) , 2000, Genome Biology.

[26]  M. Gerstein,et al.  Whole-genome trees based on the occurrence of folds and orthologs: implications for comparing genomes on different levels. , 2000, Genome research.

[27]  B. Snel,et al.  Gene and context: integrative approaches to genome analysis. , 2000, Advances in protein chemistry.

[28]  E V Koonin,et al.  Lineage-specific gene expansions in bacterial and archaeal genomes. , 2001, Genome research.

[29]  X. Gu,et al.  Maximum-likelihood approach for gene family evolution under functional divergence. , 2001, Molecular biology and evolution.

[30]  Jonathan P. Bollback,et al.  Bayesian Inference of Phylogeny and Its Impact on Evolutionary Biology , 2001, Science.

[31]  D Sankoff,et al.  Gene and genome duplication. , 2001, Current opinion in genetics & development.

[32]  M. Gerstein,et al.  Protein family and fold occurrence in genomes: power-law behaviour and evolutionary model. , 2001, Journal of molecular biology.

[33]  Michael J. Stanhope,et al.  Universal trees based on large combined protein sequence data sets , 2001, Nature Genetics.

[34]  S. Fitz-Gibbon,et al.  Using Homolog Groups to Create a Whole-Genomic Tree of Free-Living Organisms: An Update , 2002, Journal of Molecular Evolution.

[35]  Masatoshi Nei,et al.  Overcredibility of molecular phylogenies obtained by Bayesian phylogenetics , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[36]  X. Gu,et al.  Testing the parsimony test of genome duplications: a counterexample. , 2002, Genome research.

[37]  E. Koonin,et al.  The structure of the protein universe and genome evolution , 2002, Nature.

[38]  M. Ragan,et al.  Inferring Genome Trees by Using a Filter To Eliminate Phylogenetically Discordant Sequences and a Distance Matrix Based on Mean Normalized BLASTP Scores , 2002, Journal of bacteriology.

[39]  Peer Bork,et al.  Comparative Genome and Proteome Analysis of Anopheles gambiae and Drosophila melanogaster , 2002, Science.

[40]  W. Reed,et al.  On the size distribution of live genera. , 2002, Journal of theoretical biology.

[41]  B. Snel,et al.  SHOT: a web server for the construction of genome phylogenies. , 2002, Trends in genetics : TIG.

[42]  N. Grishin,et al.  Genome trees and the tree of life. , 2002, Trends in genetics : TIG.

[43]  Xun Gu,et al.  Age distribution of human gene families shows significant roles of both large- and small-scale duplications in vertebrate evolution , 2002, Nature Genetics.

[44]  N. Moran,et al.  Phylogenetics and the Cohesion of Bacterial Genomes , 2003, Science.

[45]  F. Lutzoni,et al.  Bayes or bootstrap? A simulation study comparing the performance of Bayesian Markov chain Monte Carlo sampling and bootstrapping in assessing phylogenetic confidence. , 2003, Molecular biology and evolution.

[46]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.