Model Selection in Phylogenetics

▪ Abstract Investigation into model selection has a long history in the statistical literature. As model-based approaches begin dominating systematic biology, increased attention has focused on how models should be selected for distance-based, likelihood, and Bayesian phylogenetics. Here, we review issues that render model-based approaches necessary, briefly review nucleotide-based models that attempt to capture relevant features of evolutionary processes, and review methods that have been applied to model selection in phylogenetics: likelihood-ratio tests, AIC, BIC, and performance-based approaches.

[1]  Allan C. Wilson,et al.  Mitochondrial DNA sequences of primates: Tempo and mode of evolution , 2005, Journal of Molecular Evolution.

[2]  J. Sullivan,et al.  Comparative Phylogeography of Mesoamerican Highland Rodents: Concerted versus Independent Response to Past Climatic Fluctuations , 2000, The American Naturalist.

[3]  D. Pol Empirical problems of the hierarchical likelihood ratio test for model selection. , 2004, Systematic biology.

[4]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[5]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[6]  T. Buckley,et al.  Model misspecification and probabilistic tests of topology: evidence from empirical data sets. , 2002, Systematic biology.

[7]  K. Holsinger,et al.  The effect of topology on estimates of among-site rate variation , 1996, Journal of Molecular Evolution.

[8]  J. Huelsenbeck,et al.  Bayesian phylogenetic model selection using reversible jump Markov chain Monte Carlo. , 2004, Molecular biology and evolution.

[9]  T. Britton,et al.  Reliability of Bayesian posterior probabilities and bootstrap frequencies in phylogenetics. , 2003, Systematic biology.

[10]  J. S. Rogers,et al.  Bias in phylogenetic estimation and its relevance to the choice between parsimony and likelihood methods. , 2001, Systematic biology.

[11]  W. Bruno,et al.  Topological bias and inconsistency of maximum likelihood using wrong models. , 1999, Molecular biology and evolution.

[12]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[13]  Thomas Uzzell,et al.  Fitting Discrete Probability Distributions to Evolutionary Events , 1971, Science.

[14]  D. Swofford,et al.  Evolution of the Mitochondrial Cytochrome Oxidase II Gene in Collembola , 1997, Journal of Molecular Evolution.

[15]  C. W. Kilpatrick,et al.  Phylogeography and molecular systematics of the Peromyscus aztecus species group (Rodentia: Muridae) inferred using parsimony and likelihood. , 1997, Systematic biology.

[16]  Jonathan P. Bollback,et al.  Bayesian Inference of Phylogeny and Its Impact on Evolutionary Biology , 2001, Science.

[17]  C. Simon,et al.  Exploring among-site rate variation models in a maximum likelihood framework using empirical data: effects of model assumptions on estimates of topology, branch lengths, and bootstrap support. , 2001, Systematic biology.

[18]  N. Galtier,et al.  Maximum-likelihood phylogenetic analysis under a covarion-like model. , 2001, Molecular biology and evolution.

[19]  A. Raftery Bayesian Model Selection in Social Research , 1995 .

[20]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[21]  D. Swofford,et al.  Should we be worried about long-branch attraction in real data sets? Investigations using metazoan 18S rDNA. , 2004, Molecular phylogenetics and evolution.

[22]  W. Li,et al.  Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. , 1995, Molecular biology and evolution.

[23]  Ho-Yeon Han,et al.  Molecular phylogeny of the superfamily Tephritoidea (Insecta: Diptera): new evidence from the mitochondrial 12S, 16S, and COII genes. , 2005, Molecular phylogenetics and evolution.

[24]  K. Kjer,et al.  Aligned 18S and insect phylogeny. , 2004, Systematic biology.

[25]  D. Swofford,et al.  Should we use model-based methods for phylogenetic inference when we know that assumptions about among-site rate variation and nucleotide substitution pattern are violated? , 2001, Systematic biology.

[26]  G. Box Science and Statistics , 1976 .

[27]  Jeremiah D. Degenhardt,et al.  Testing nested phylogenetic and phylogeographic hypotheses in the Plethodon vandykei species group. , 2004, Systematic biology.

[28]  Thomas W H Lui,et al.  Empirical models for substitution in ribosomal RNA. , 2003, Molecular biology and evolution.

[29]  P. Lewis,et al.  Success of maximum likelihood phylogeny inference in the four-taxon case. , 1995, Molecular biology and evolution.

[30]  T. Castoe,et al.  Data partitions and complex models in Bayesian analysis: the phylogeny of Gymnophthalmid lizards. , 2004, Systematic biology.

[31]  D. Swofford,et al.  The Effect of Taxon Sampling on Estimating Rate Heterogeneity Parameters of Maximum-Likelihood Models , 1999 .

[32]  Carla Cicero,et al.  Phylogeny and character evolution in the Empidonax group of tyrant flycatchers (Aves: Tyrannidae): a test of W. E. Lanyon's hypothesis using mtDNA sequences. , 2002, Molecular phylogenetics and evolution.

[33]  M. Siddall,et al.  Probabilism and Phylogenetic Inference , 1997, Cladistics : the international journal of the Willi Hennig Society.

[34]  J. Sullivan,et al.  Extensive mtDNA variation within the yellow-pine chipmunk, Tamias amoenus (Rodentia: Sciuridae), and phylogeographic inferences for northwest North America. , 2003, Molecular phylogenetics and evolution.

[35]  J. Huelsenbeck,et al.  Bayesian phylogenetic analysis of combined data. , 2004, Systematic biology.

[36]  Walter R. Gilks,et al.  Hypothesis testing and model selection , 1995 .

[37]  K. Crandall,et al.  Phylogeny Estimation and Hypothesis Testing Using Maximum Likelihood , 1997 .

[38]  Zaid Abdo,et al.  Performance-based selection of likelihood models for phylogeny estimation. , 2003, Systematic biology.

[39]  G. B. Golding,et al.  Estimates of DNA and protein sequence divergence: an examination of some assumptions. , 1983, Molecular biology and evolution.

[40]  David Posada,et al.  MODELTEST: testing the model of DNA substitution , 1998, Bioinform..

[41]  N. Goldman,et al.  A codon-based model of nucleotide substitution for protein-coding DNA sequences. , 1994, Molecular biology and evolution.

[42]  P. Sharp,et al.  Origin of AIDS: Contaminated polio vaccine theory refuted , 2004, Nature.

[43]  Peter G Foster,et al.  Modeling compositional heterogeneity. , 2004, Systematic biology.

[44]  K. Crandall,et al.  Selecting the best-fit model of nucleotide substitution. , 2001, Systematic biology.

[45]  M. Gouy,et al.  Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. , 1998, Molecular biology and evolution.

[46]  Z. Yang,et al.  On the use of nucleic acid sequences to infer early branchings in the tree of life. , 1995, Molecular biology and evolution.

[47]  Joseph Felsenstein,et al.  Taking Variation of Evolutionary Rates Between Sites into Account in Inferring Phylogenies , 2001, Journal of Molecular Evolution.

[48]  Z. Yang,et al.  How often do wrong models produce better phylogenies? , 1997, Molecular biology and evolution.

[49]  J. Felsenstein Cases in which Parsimony or Compatibility Methods will be Positively Misleading , 1978 .

[50]  M. Steel,et al.  Modeling the covarion hypothesis of nucleotide substitution. , 1998, Mathematical biosciences.

[51]  Zaid Abdo,et al.  Accounting for uncertainty in the tree topology has little effect on the decision-theoretic approach to model selection in phylogeny estimation. , 2005, Molecular biology and evolution.

[52]  W. Fitch,et al.  An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution , 1970, Biochemical Genetics.

[53]  R. Fisher Statistical methods for research workers , 1927, Protoplasma.

[54]  Emily C. Moriarty,et al.  The importance of proper model assumption in bayesian phylogenetics. , 2004, Systematic biology.

[55]  D. Posada,et al.  Model selection and model averaging in phylogenetics: advantages of akaike information criterion and bayesian approaches over likelihood ratio tests. , 2004, Systematic biology.

[56]  B. Rannala,et al.  Frequentist properties of Bayesian posterior probabilities of phylogenetic trees under simple and complex substitution models. , 2004, Systematic biology.

[57]  D. Hillis,et al.  BEST‐FIT MAXIMUM‐LIKELIHOOD MODELS FOR PHYLOGENETIC INFERENCE: EMPIRICAL TESTS WITH KNOWN PHYLOGENIES , 1998, Evolution; international journal of organic evolution.

[58]  S. Muse,et al.  A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. , 1994, Molecular biology and evolution.

[59]  David P. Mindell,et al.  Molecular evidence of HIV-1 transmission in a criminal case , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[60]  K. Winemiller,et al.  Molecular phylogeny and evidence for an adaptive radiation of geophagine cichlids from South America (Perciformes: Labroidei). , 2005, Molecular phylogenetics and evolution.

[61]  J. Huelsenbeck Testing a covariotide model of DNA substitution. , 2002, Molecular biology and evolution.

[62]  J. Huelsenbeck,et al.  MRBAYES : Bayesian inference of phylogeny , 2001 .

[63]  Jonathan P. Bollback,et al.  Bayesian model adequacy and choice in phylogenetics. , 2002, Molecular biology and evolution.

[64]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[65]  C. Cunningham,et al.  The effects of nucleotide substitution model assumptions on estimates of nonparametric bootstrap support. , 2002, Molecular biology and evolution.

[66]  B. Rannala Identi(cid:142)ability of Parameters in MCMC Bayesian Inference of Phylogeny , 2002 .

[67]  A. Halpern,et al.  Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. , 1998, Molecular biology and evolution.

[68]  M. Nei,et al.  Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. , 1993, Molecular biology and evolution.

[69]  John P. Huelsenbeck,et al.  MRBAYES: Bayesian inference of phylogenetic trees , 2001, Bioinform..

[70]  Jerald B. Johnson,et al.  Model selection in ecology and evolution. , 2004, Trends in ecology & evolution.

[71]  Nick Goldman,et al.  Statistical tests of models of DNA substitution , 1993, Journal of Molecular Evolution.

[72]  P. Lio’,et al.  Molecular phylogenetics: state-of-the-art methods for looking into the past. , 2001, Trends in genetics : TIG.

[73]  M. Siddall,et al.  Success of Parsimony in the Four‐Taxon Case: Long‐Branch Repulsion by Likelihood in the Farris Zone , 1998 .

[74]  Z. Yang,et al.  Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. , 1993, Molecular biology and evolution.

[75]  Ziheng Yang Estimating the pattern of nucleotide substitution , 1994, Journal of Molecular Evolution.

[76]  H. Akaike,et al.  Information Theory and an Extension of the Maximum Likelihood Principle , 1973 .

[77]  J. Huelsenbeck,et al.  SUCCESS OF PHYLOGENETIC METHODS IN THE FOUR-TAXON CASE , 1993 .

[78]  J. Gillespie,et al.  RATES OF MOLECULAR EVOLUTION , 1986 .

[79]  M. Suchard,et al.  Bayesian selection of continuous-time Markov chain evolutionary models. , 2001, Molecular biology and evolution.

[80]  M. Hasegawa Phylogeny and molecular evolution in primates. , 1990, Idengaku zasshi.

[81]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[82]  David L. Swofford,et al.  Are Guinea Pigs Rodents? The Importance of Adequate Models in Molecular Phylogenetics , 1997, Journal of Mammalian Evolution.

[83]  Rory A. Fisher,et al.  Statistical Methods for Research Workers. , 1956 .

[84]  David R. Anderson,et al.  Multimodel Inference , 2004 .

[85]  E. Tillier,et al.  Neighbor Joining and Maximum Likelihood with RNA Sequences: Addressing the Interdependence of Sites , 1995 .