An empirical examination of the utility of codon-substitution models in phylogeny reconstruction.

Models of codon substitution have been commonly used to compare protein-coding DNA sequences and are particularly effective in detecting signals of natural selection acting on the protein. Their utility in reconstructing molecular phylogenies and in dating species divergences has not been explored. Codon models naturally accommodate synonymous and nonsynonymous substitutions, which occur at very different rates and may be informative for recent and ancient divergences, respectively. Thus codon models may be expected to make an efficient use of phylogenetic information in protein-coding DNA sequences. Here we applied codon models to 106 protein-coding genes from eight yeast species to reconstruct phylogenies using the maximum likelihood method, in comparison with nucleotide- and amino acid-based analyses. The results appeared to confirm that expectation. Nucleotide-based analysis, under simplistic substitution models, were efficient in recovering recent divergences whereas amino acid-based analysis performed better at recovering deep divergences. Codon models appeared to combine the advantages of amino acid and nucleotide data and had good performance at recovering both recent and deep divergences. Estimation of relative species divergence times using amino acid and codon models suggested that translation of gene sequences into proteins led to information loss of from 30% for deep nodes to 66% for recent nodes. Although computational burden makes codon models unfeasible for tree search in large data sets, we suggest that they may be useful for comparing candidate trees. Nucleotide models that accommodate the differences in evolutionary dynamics at the three codon positions also performed well, at much less computational cost. We discuss the relationship between a model's fit to data and its utility in phylogeny reconstruction and caution against use of overly complex substitution models.

[1]  W. H. Piel,et al.  An assessment of accuracy, error, and conflict with support values from genome-scale phylogenetic data. , 2004, Molecular biology and evolution.

[2]  A. Meyer,et al.  Phylogenetic performance of mitochondrial protein-coding genes in resolving relationships among vertebrates. , 1996, Molecular biology and evolution.

[3]  H. Munro,et al.  Mammalian protein metabolism , 1964 .

[4]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[5]  Masami Hasegawa,et al.  Accuracies of the simple methods for estimating the bootstrap probability of a maximum-likelihood tree , 1994 .

[6]  Sergei L. Kosakovsky Pond,et al.  HyPhy: hypothesis testing using phylogenies , 2005, Bioinform..

[7]  Z. Yang,et al.  Models of amino acid substitution and applications to mitochondrial protein evolution. , 1998, Molecular biology and evolution.

[8]  Z. Yang,et al.  Estimation of primate speciation dates using local molecular clocks. , 2000, Molecular biology and evolution.

[9]  J. Huelsenbeck Is the Felsenstein zone a fly trap? , 1997, Systematic biology.

[10]  M. Donoghue,et al.  Recreating a functional ancestral archosaur visual pigment. , 2002, Molecular biology and evolution.

[11]  M. P. Cummings,et al.  Sampling properties of DNA sequence data in phylogenetic analysis. , 1995, Molecular biology and evolution.

[12]  N. Goldman,et al.  Codon-substitution models for heterogeneous selection pressure at amino acid sites. , 2000, Genetics.

[13]  N Takezaki,et al.  Efficiencies of different genes and different tree-building methods in recovering a known vertebrate phylogeny. , 1996, Molecular biology and evolution.

[14]  Ziheng Yang,et al.  Maximum-likelihood models for combined analyses of multiple sequence data , 1996, Journal of Molecular Evolution.

[15]  D. Swofford,et al.  Should we use model-based methods for phylogenetic inference when we know that assumptions about among-site rate variation and nucleotide substitution pattern are violated? , 2001, Systematic biology.

[16]  Ziheng Yang,et al.  A Maximum Likelihood Method for Detecting Functional Divergence at Individual Codon Sites, with Application to Gene Family Evolution , 2004, Journal of Molecular Evolution.

[17]  H. Kishino,et al.  Maximum likelihood inference of protein phylogeny and the origin of chloroplasts , 1990, Journal of Molecular Evolution.

[18]  R. Nielsen,et al.  Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. , 1998, Genetics.

[19]  Vincent Moulton,et al.  Using consensus networks to visualize contradictory evidence for species phylogeny. , 2004, Molecular biology and evolution.

[20]  D. Penny,et al.  Genome-scale phylogeny and the detection of systematic biases. , 2004, Molecular biology and evolution.

[21]  Hirohisa Kishino,et al.  Estimating absolute rates of synonymous and nonsynonymous nucleotide substitution in order to characterize natural selection and date species divergences. , 2004, Molecular biology and evolution.

[22]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[23]  Ziheng Yang Maximum Likelihood Estimation on Large Phylogenies and Analysis of Adaptive Evolution in Human Influenza Virus A , 2000, Journal of Molecular Evolution.

[24]  Z. Yang,et al.  Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. , 1993, Molecular biology and evolution.

[25]  Z. Yang,et al.  How often do wrong models produce better phylogenies? , 1997, Molecular biology and evolution.

[26]  Ziheng Yang Estimating the pattern of nucleotide substitution , 1994, Journal of Molecular Evolution.

[27]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[28]  M. Hasegawa,et al.  Model of amino acid substitution in proteins encoded by mitochondrial DNA , 1996, Journal of Molecular Evolution.

[29]  Ziheng Yang Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods , 1994, Journal of Molecular Evolution.

[30]  N. Goldman,et al.  A codon-based model of nucleotide substitution for protein-coding DNA sequences. , 1994, Molecular biology and evolution.

[31]  M. O. Dayhoff A model of evolutionary change in protein , 1978 .

[32]  S. Miyazawa,et al.  Two types of amino acid substitutions in protein evolution , 1979, Journal of Molecular Evolution.

[33]  M. Gouy,et al.  Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. , 1998, Molecular biology and evolution.

[34]  M. Nei,et al.  Relative efficiencies of the maximum-likelihood, neighbor-joining, and maximum-parsimony methods when substitution rate varies with site. , 1994, Molecular biology and evolution.

[35]  B. Efron,et al.  Assessing the accuracy of the maximum likelihood estimator: Observed versus expected Fisher information , 1978 .

[36]  H. Akaike A new look at the statistical model identification , 1974 .

[37]  J. Felsenstein Cases in which Parsimony or Compatibility Methods will be Positively Misleading , 1978 .

[38]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[39]  P. Lio’,et al.  Models of molecular evolution and phylogeny. , 1998, Genome research.

[40]  J. Huelsenbeck,et al.  Bayesian Estimation of Positively Selected Sites , 2004, Journal of Molecular Evolution.

[41]  J. Felsenstein CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAP , 1985, Evolution; international journal of organic evolution.

[42]  S. Whelan,et al.  A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. , 2001, Molecular biology and evolution.

[43]  P. Lewis,et al.  Success of maximum likelihood phylogeny inference in the four-taxon case. , 1995, Molecular biology and evolution.

[44]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[45]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[46]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[47]  Z. Yang,et al.  Likelihood ratio tests for detecting positive selection and application to primate lysozyme evolution. , 1998, Molecular biology and evolution.

[48]  Nick Goldman,et al.  MAXIMUM LIKELIHOOD TREES FROM DNA SEQUENCES: A PECULIAR STATISTICAL ESTIMATION PROBLEM , 1995 .

[49]  S. Muse,et al.  A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. , 1994, Molecular biology and evolution.

[50]  T. Gojobori,et al.  Correct and incorrect vertebrate phylogenies obtained by the entire mitochondrial DNA sequences. , 1999, Molecular biology and evolution.

[51]  S. Carroll,et al.  Genome-scale approaches to resolving incongruence in molecular phylogenies , 2003, Nature.

[52]  Ziheng Yang,et al.  PAML: a program package for phylogenetic analysis by maximum likelihood , 1997, Comput. Appl. Biosci..

[53]  David Posada,et al.  MODELTEST: testing the model of DNA substitution , 1998, Bioinform..

[54]  Roald Forsberg,et al.  A codon-based model of host-specific selection in parasites, with an application to the influenza A virus. , 2003, Molecular biology and evolution.

[55]  Stéphane Guindon,et al.  Modeling the site-specific variation of selection patterns along lineages. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[56]  Zaid Abdo,et al.  Performance-based selection of likelihood models for phylogeny estimation. , 2003, Systematic biology.