The importance of proper model assumption in bayesian phylogenetics.

We studied the importance of proper model assumption in the context of Bayesian phylogenetics by examining >5,000 Bayesian analyses and six nested models of nucleotide substitution. Model misspecification can strongly bias bipartition posterior probability estimates. These biases were most pronounced when rate heterogeneity was ignored. The type of bias seen at a particular bipartition appeared to be strongly influenced by the lengths of the branches surrounding that bipartition. In the Felsenstein zone, posterior probability estimates of bipartitions were biased when the assumed model was underparameterized but were unbiased when the assumed model was overparameterized. For the inverse Felsenstein zone, however, both underparameterization and overparameterization led to biased bipartition posterior probabilities, although the bias caused by overparameterization was less pronounced and disappeared with increased sequence length. Model parameter estimates were also affected by model misspecification. Underparameterization caused a bias in some parameter estimates, such as branch lengths and the gamma shape parameter, whereas overparameterization caused a decrease in the precision of some parameter estimates. We caution researchers to assure that the most appropriate model is assumed by employing both a priori model choice methods and a posteriori model adequacy tests.

[1]  J. S. Rogers,et al.  Bias in phylogenetic estimation and its relevance to the choice between parsimony and likelihood methods. , 2001, Systematic biology.

[2]  N. Goldman,et al.  Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. , 1994, Molecular biology and evolution.

[3]  D. Hillis,et al.  BEST‐FIT MAXIMUM‐LIKELIHOOD MODELS FOR PHYLOGENETIC INFERENCE: EMPIRICAL TESTS WITH KNOWN PHYLOGENIES , 1998, Evolution; international journal of organic evolution.

[4]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[5]  Antonis Rokas,et al.  Comparing bootstrap and posterior probability values in the four-taxon case. , 2003, Systematic biology.

[6]  S. Muse,et al.  A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. , 1994, Molecular biology and evolution.

[7]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[8]  P. Lewis,et al.  Success of maximum likelihood phylogeny inference in the four-taxon case. , 1995, Molecular biology and evolution.

[9]  N. Goldman,et al.  A codon-based model of nucleotide substitution for protein-coding DNA sequences. , 1994, Molecular biology and evolution.

[10]  K. Crandall,et al.  Selecting the best-fit model of nucleotide substitution. , 2001, Systematic biology.

[11]  S. Muse Evolutionary analyses of DNA sequences subject to constraints of secondary structure. , 1995, Genetics.

[12]  M. Gouy,et al.  Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. , 1998, Molecular biology and evolution.

[13]  D Penny,et al.  Evolution of chlorophyll and bacteriochlorophyll: the problem of invariant sites in sequence analysis. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[14]  J. Huelsenbeck,et al.  A compound poisson process for relaxing the molecular clock. , 2000, Genetics.

[15]  Z. Yang,et al.  Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. , 1993, Molecular biology and evolution.

[16]  G. B. Golding,et al.  Estimates of DNA and protein sequence divergence: an examination of some assumptions. , 1983, Molecular biology and evolution.

[17]  E. Tillier,et al.  High apparent rate of simultaneous compensatory base-pair substitutions in ribosomal RNA. , 1998, Genetics.

[18]  A. Halpern,et al.  Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. , 1998, Molecular biology and evolution.

[19]  P. J. Mason,et al.  Comparison of models , 1996 .

[20]  J. Felsenstein,et al.  A Hidden Markov Model approach to variation among sites in rate of evolution. , 1996, Molecular biology and evolution.

[21]  J. Felsenstein,et al.  A simulation comparison of phylogeny algorithms under equal and unequal evolutionary rates. , 1994, Molecular biology and evolution.

[22]  J. Oliver,et al.  The general stochastic model of nucleotide substitution. , 1990, Journal of theoretical biology.

[23]  Nick Goldman,et al.  Statistical tests of models of DNA substitution , 1993, Journal of Molecular Evolution.

[24]  Derrick J. Zwickl,et al.  Phylogenetic relationships of the dwarf boas and a comparison of Bayesian and bootstrap measures of phylogenetic support. , 2002, Molecular phylogenetics and evolution.

[25]  W. Li,et al.  Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. , 1995, Molecular biology and evolution.

[26]  D. Swofford,et al.  Should we use model-based methods for phylogenetic inference when we know that assumptions about among-site rate variation and nucleotide substitution pattern are violated? , 2001, Systematic biology.

[27]  John P. Huelsenbeck,et al.  MRBAYES: Bayesian inference of phylogenetic trees , 2001, Bioinform..

[28]  Michael J. Sanderson,et al.  A Nonparametric Approach to Estimating Divergence Times in the Absence of Rate Constancy , 1997 .

[29]  D. Balding,et al.  Models of sequence evolution for DNA sequences containing gaps. , 2001, Molecular biology and evolution.

[30]  Zaid Abdo,et al.  Performance-based selection of likelihood models for phylogeny estimation. , 2003, Systematic biology.

[31]  J. Felsenstein Cases in which Parsimony or Compatibility Methods will be Positively Misleading , 1978 .

[32]  László A. Székely,et al.  A complete family of phylogenetic invariants for any number of taxa under Kimura's 3ST model , 1993 .

[33]  A. Lemmon,et al.  The metapopulation genetic algorithm: An efficient solution for the problem of large phylogeny estimation , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[34]  W. Doolittle,et al.  Comparison of Bayesian and maximum likelihood bootstrap measures of phylogenetic reliability. , 2003, Molecular biology and evolution.

[35]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[36]  R. Nielsen,et al.  Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. , 1998, Genetics.

[37]  Nina Amenta,et al.  Case study: visualizing sets of evolutionary trees , 2002, IEEE Symposium on Information Visualization, 2002. INFOVIS 2002..

[38]  Jonathan P. Bollback,et al.  Bayesian Inference of Phylogeny and Its Impact on Evolutionary Biology , 2001, Science.

[39]  F. Lutzoni,et al.  Bayes or bootstrap? A simulation study comparing the performance of Bayesian Markov chain Monte Carlo sampling and bootstrapping in assessing phylogenetic confidence. , 2003, Molecular biology and evolution.

[40]  H. Munro,et al.  Mammalian protein metabolism , 1964 .

[41]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[42]  T. Britton,et al.  Reliability of Bayesian posterior probabilities and bootstrap frequencies in phylogenetics. , 2003, Systematic biology.

[43]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[44]  H. Kishino,et al.  Estimating the rate of evolution of the rate of molecular evolution. , 1998, Molecular biology and evolution.

[45]  Jonathan P. Bollback,et al.  Bayesian model adequacy and choice in phylogenetics. , 2002, Molecular biology and evolution.

[46]  G. Serio,et al.  A new method for calculating evolutionary substitution rates , 2005, Journal of Molecular Evolution.

[47]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[48]  K. Holsinger,et al.  Among-site rate variation and phylogenetic analysis of 12S rRNA in sigmodontine rodents. , 1995, Molecular biology and evolution.

[49]  D. Cannatella,et al.  Phylogenetic relationships of the North American chorus frogs (Pseudacris: Hylidae). , 2004, Molecular phylogenetics and evolution.

[50]  ohn,et al.  Potential Applications and Pitfalls of Bayesian Inference of Phylogeny , 2002 .

[51]  J. Huelsenbeck,et al.  SUCCESS OF PHYLOGENETIC METHODS IN THE FOUR-TAXON CASE , 1993 .

[52]  M. Steel,et al.  Recovering evolutionary trees under a more realistic model of sequence evolution. , 1994, Molecular biology and evolution.

[53]  Masatoshi Nei,et al.  Overcredibility of molecular phylogenies obtained by Bayesian phylogenetics , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[54]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[55]  W. Bruno,et al.  Performance of a divergence time estimation method under a probabilistic model of rate evolution. , 2001, Molecular biology and evolution.

[56]  H. Akaike A new look at the statistical model identification , 1974 .