Identifying Optimal Models of Evolution.

Most phylogenetic methods are model-based and depend on models of evolution designed to approximate the evolutionary processes. Several methods have been developed to identify suitable models of evolution for phylogenetic analysis of alignments of nucleotide or amino acid sequences and some of these methods are now firmly embedded in the phylogenetic protocol. However, in a disturbingly large number of cases, it appears that these models were used without acknowledgement of their inherent shortcomings. In this chapter, we discuss the problem of model selection and show how some of the inherent shortcomings may be identified and overcome.

[1]  S. Muse Evolutionary analyses of DNA sequences subject to constraints of secondary structure. , 1995, Genetics.

[2]  Wei‐Jen Chen,et al.  Are flatfishes (Pleuronectiformes) monophyletic? , 2013, Molecular phylogenetics and evolution.

[3]  C Cannings,et al.  Natural selection and the de Finetti diagram , 1968, Annals of human genetics.

[4]  R. Lanfear,et al.  Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses. , 2012, Molecular biology and evolution.

[5]  A. Rzhetsky Estimating substitution rates in ribosomal RNA genes. , 1995, Genetics.

[6]  J. Lake,et al.  Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[7]  D Penny,et al.  Progress with methods for constructing evolutionary trees. , 1992, Trends in ecology & evolution.

[8]  John Robinson,et al.  Estimation of Phylogeny Using a General Markov Model , 2005, Evolutionary bioinformatics online.

[9]  Nick Goldman,et al.  Statistical tests of models of DNA substitution , 1993, Journal of Molecular Evolution.

[10]  M. Steel Recovering a tree from the leaf colourations it generates under a Markov model , 1994 .

[11]  A. von Haeseler,et al.  IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies , 2014, Molecular biology and evolution.

[12]  Vivek Jayaswal,et al.  Reducing model complexity of the general Markov model of evolution. , 2011, Molecular biology and evolution.

[13]  D. Bryant,et al.  A Simple and Robust Statistical Test for Detecting the Presence of Recombination , 2006, Genetics.

[14]  Chenhong Li,et al.  Phylogenetics of Chondrichthyes and the problem of rooting phylogenies with distant outgroups. , 2012, Molecular phylogenetics and evolution.

[15]  David Posada,et al.  MODELTEST: testing the model of DNA substitution , 1998, Bioinform..

[16]  Mikael Thollesson,et al.  LDDist: a Perl module for calculating LogDet pair-wise distances for protein and nucleotide sequences , 2004, Bioinform..

[17]  S. Pääbo,et al.  Conflict Among Individual Mitochondrial Proteins in Resolving the Phylogeny of Eutherian Orders , 1998, Journal of Molecular Evolution.

[18]  Chris Field,et al.  Estimation of rates-across-sites distributions in phylogenetic substitution models. , 2003, Systematic biology.

[19]  Joel Sjöstrand,et al.  Integrating Sequence Evolution into Probabilistic Orthology Analysis. , 2015, Systematic biology.

[20]  J. Dutheil,et al.  Non-homogeneous models of sequence evolution in the Bio++ suite of libraries and programs , 2008, BMC Evolutionary Biology.

[21]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[22]  Richard A. Goldstein,et al.  rtREV: An Amino Acid Substitution Matrix for Inference of Retrovirus and Reverse Transcriptase Phylogeny , 2002, Journal of Molecular Evolution.

[23]  Fitch Wm An estimation of the number of invariable sites is necessary for the accurate estimation of the number of nucleotide substitutions since a common ancestor. , 1986 .

[24]  Ziheng Yang,et al.  Computational Molecular Evolution , 2006 .

[25]  J. L. Jensen,et al.  A dependent-rates model and an MCMC-based methodology for the maximum-likelihood analysis of sequences with overlapping reading frames. , 2001, Molecular biology and evolution.

[26]  L. Jermiin,et al.  Characterization of the type I interferon locus and identification of novel genes. , 2004, Genomics.

[27]  Ramón Doallo,et al.  CircadiOmics: integrating circadian genomics, transcriptomics, proteomics and metabolomics , 2012, Nature Methods.

[28]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[29]  P J Waddell,et al.  Using novel phylogenetic methods to evaluate mammalian mtDNA, including amino acid-invariant sites-LogDet plus site stripping, to detect internal conflicts in the data, with special reference to the positions of hedgehog, armadillo, and elephant. , 1999, Systematic biology.

[30]  M. Charleston,et al.  Preferential host switching by primate lentiviruses can account for phylogenetic similarity with the primate phylogeny. , 2002, Systematic biology.

[31]  J. Sullivan,et al.  Comparative Phylogeography of Mesoamerican Highland Rodents: Concerted versus Independent Response to Past Climatic Fluctuations , 2000, The American Naturalist.

[32]  P. Waddell,et al.  Towards resolving the interordinal relationships of placental mammals. , 1999, Systematic biology.

[33]  M. Gouy,et al.  Inferring pattern and process: maximum-likelihood implementation of a nonhomogeneous model of DNA sequence evolution for phylogenetic analysis. , 1998, Molecular biology and evolution.

[34]  S. Carroll,et al.  Animal Evolution and the Molecular Signature of Radiations Compressed in Time , 2005, Science.

[35]  Mike Steel,et al.  The influence of rate heterogeneity among sites on the time dependence of molecular rates. , 2012, Molecular biology and evolution.

[36]  W. Li,et al.  Estimation of evolutionary distances under stationary and nonstationary models of nucleotide substitution. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[37]  M. Spencer,et al.  Topological Estimation Biases with Covarion Evolution , 2007, Journal of Molecular Evolution.

[38]  H. Akaike A new look at the statistical model identification , 1974 .

[39]  M. Gouy,et al.  A Branch-Heterogeneous Model of Protein Evolution for Efficient Inference of Ancestral Sequences , 2013, Systematic biology.

[40]  Z. Yang,et al.  Models of amino acid substitution and applications to mitochondrial protein evolution. , 1998, Molecular biology and evolution.

[41]  David R. Cox,et al.  Further Results on Tests of Separate Families of Hypotheses , 1962 .

[42]  Ziheng Yang,et al.  INDELible: A Flexible Simulator of Biological Sequence Evolution , 2009, Molecular biology and evolution.

[43]  J. Hartigan,et al.  Statistical Analysis of Hominoid Molecular Evolution , 1987 .

[44]  G. Serio,et al.  A new method for calculating evolutionary substitution rates , 2005, Journal of Molecular Evolution.

[45]  Leon Poladian,et al.  Two stationary nonhomogeneous Markov models of nucleotide sequence evolution. , 2011, Systematic biology.

[46]  R. Lanfear,et al.  Selecting optimal partitioning schemes for phylogenomic datasets , 2014, BMC Evolutionary Biology.

[47]  M. Gouy,et al.  A nonhyperthermophilic common ancestor to extant life forms. , 1999, Science.

[48]  W. Fitch,et al.  Construction of phylogenetic trees. , 1967, Science.

[49]  E. Susko,et al.  Fast statistical tests for detecting heterotachy in protein evolution. , 2011, Molecular biology and evolution.

[50]  E. Susko,et al.  A test for heterotachy using multiple pairs of sequences. , 2011, Molecular biology and evolution.

[51]  A Rzhetsky,et al.  Tests of applicability of several substitution models for DNA sequence data. , 1995, Molecular biology and evolution.

[52]  Thomas K. F. Wong,et al.  Mixture models of nucleotide sequence evolution that account for heterogeneity in the substitution process across sites and across lineages. , 2014, Systematic biology.

[53]  Faisal Ababneh,et al.  Matched-pairs tests of homogeneity with applications to homologous nucleotide sequences , 2006, Bioinform..

[54]  N. Goldman,et al.  A codon-based model of nucleotide substitution for protein-coding DNA sequences. , 1994, Molecular biology and evolution.

[55]  Magnus Rattray,et al.  On the correlation between composition and site-specific evolutionary rate: implications for phylogenetic inference. , 2006, Molecular biology and evolution.

[56]  Nicolas Lartillot,et al.  A site- and time-heterogeneous model of amino acid replacement. , 2008, Molecular biology and evolution.

[57]  M. Hasegawa,et al.  Model of amino acid substitution in proteins encoded by mitochondrial DNA , 1996, Journal of Molecular Evolution.

[58]  M. Suchard,et al.  Bayesian selection of continuous-time Markov chain evolutionary models. , 2001, Molecular biology and evolution.

[59]  P. Waddell,et al.  Plastid Genome Phylogeny and a Model of Amino Acid Substitution for Proteins Encoded by Chloroplast DNA , 2000, Journal of Molecular Evolution.

[60]  J. Sumner,et al.  A New Hierarchy of Phylogenetic Models Consistent with Heterogeneous Substitution Rates , 2014, Systematic biology.

[61]  Faisal Ababneh,et al.  The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. , 2004, Systematic biology.

[62]  David R. Anderson,et al.  Model selection and multimodel inference : a practical information-theoretic approach , 2003 .

[63]  Vivek Jayaswal,et al.  SeqVis: a tool for detecting compositional heterogeneity among aligned nucleotide sequences. , 2009, Methods in molecular biology.

[64]  G Chelvanayagam,et al.  Identification, Characterization, and Crystal Structure of the Omega Class Glutathione Transferases* , 2000, The Journal of Biological Chemistry.

[65]  C. Wiuf,et al.  A codon-based model designed to describe lentiviral evolution. , 1998, Molecular biology and evolution.

[66]  J. Reeves,et al.  Heterogeneity in the substitution process of amino acid sites of proteins coded for by mitochondrial DNA , 1992, Journal of Molecular Evolution.

[67]  A. Kolmogoroff Zur Theorie der Markoffschen Ketten , 1936 .

[68]  Elisabeth Renée,et al.  Maximum likelihood with multiparameter models of substitution , 1994, Journal of Molecular Evolution.

[69]  W N Grundy,et al.  Phylogenetic inference from conserved sites alignments. , 1999, The Journal of experimental zoology.

[70]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[71]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[72]  Vivek Jayaswal,et al.  Estimation of phylogeny and invariant sites under the general Markov model of nucleotide sequence evolution. , 2007, Systematic biology.

[73]  E. Tillier,et al.  High apparent rate of simultaneous compensatory base-pair substitutions in ribosomal RNA. , 1998, Genetics.

[74]  M. Caterino,et al.  Molecular phylogeny, historical biogeography, and divergence time estimates for swallowtail butterflies of the genus Papilio (Lepidoptera: Papilionidae). , 2004, Systematic biology.

[75]  M. Pagel Inferring the historical patterns of biological evolution , 1999, Nature.

[76]  G. Drouin,et al.  Detecting and characterizing gene conversions between multigene family members. , 1999, Molecular biology and evolution.

[77]  J. Sullivan,et al.  Extensive mtDNA variation within the yellow-pine chipmunk, Tamias amoenus (Rodentia: Sciuridae), and phylogeographic inferences for northwest North America. , 2003, Molecular phylogenetics and evolution.

[78]  S. Whelan,et al.  A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. , 2001, Molecular biology and evolution.

[79]  M. Steel,et al.  Recovering evolutionary trees under a more realistic model of sequence evolution. , 1994, Molecular biology and evolution.

[80]  J. Huelsenbeck,et al.  Bayesian phylogenetic analysis of combined data. , 2004, Systematic biology.

[81]  David Posada,et al.  MtArt: a new model of amino acid replacement for Arthropoda. , 2006, Molecular biology and evolution.

[82]  Steven A. Benner,et al.  Reconstructing the evolutionary history of the artiodactyl ribonuclease superfamily , 1995, Nature.

[83]  Ramón Doallo,et al.  ProtTest 3: fast selection of best-fit models of protein evolution , 2011, Bioinform..

[84]  P. Higgs RNA secondary structure: physical and computational aspects , 2000, Quarterly Reviews of Biophysics.

[85]  Peter G Foster,et al.  Modeling compositional heterogeneity. , 2004, Systematic biology.

[86]  Alain Giron,et al.  Detection and characterization of horizontal transfers in prokaryotes using genomic signature , 2005, Nucleic acids research.

[87]  D. Posada,et al.  Model selection and model averaging in phylogenetics: advantages of akaike information criterion and bayesian approaches over likelihood ratio tests. , 2004, Systematic biology.

[88]  Timothy J. Harlow,et al.  Do different surrogate methods detect lateral genetic transfer events of different relative ages? , 2006, Trends in microbiology.

[89]  W. Li,et al.  Bias-corrected paralinear and LogDet distances and tests of molecular clocks and phylogenies under nonstationary nucleotide frequencies. , 1996, Molecular biology and evolution.

[90]  Jotun Hein,et al.  A maximum-likelihood approach to analyzing nonoverlapping and overlapping reading frames , 1995, Journal of Molecular Evolution.

[91]  K. Crandall,et al.  Evaluation of methods for detecting recombination from DNA sequences: Computer simulations , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[92]  L. Cavalli-Sforza,et al.  PHYLOGENETIC ANALYSIS: MODELS AND ESTIMATION PROCEDURES , 1967, Evolution; international journal of organic evolution.

[93]  Sudhir Kumar,et al.  Evolutionary distance estimation under heterogeneous substitution pattern among lineages. , 2002, Molecular biology and evolution.

[94]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[95]  E. Tillier,et al.  Neighbor Joining and Maximum Likelihood with RNA Sequences: Addressing the Interdependence of Sites , 1995 .

[96]  M. Ragan On surrogate methods for detecting lateral gene transfer. , 2001, FEMS microbiology letters.

[97]  Nicolas Lartillot,et al.  A Bayesian compound stochastic process for modeling nonstationary and nonhomogeneous sequence evolution. , 2006, Molecular biology and evolution.

[98]  David Posada,et al.  RDP2: recombination detection and analysis from sequence alignments , 2005, Bioinform..

[99]  C. Field,et al.  Fitting nonstationary general-time-reversible models to obtain edge-lengths and frequencies for the barry-hartigan model. , 2012, Systematic biology.

[100]  Z. Yang,et al.  On the use of nucleic acid sequences to infer early branchings in the tree of life. , 1995, Molecular biology and evolution.

[101]  B. Rannala,et al.  Phylogenetic methods come of age: testing hypotheses in an evolutionary context. , 1997, Science.

[102]  Stéphane Aris-Brosou,et al.  Effects of models of rate evolution on estimation of divergence dates with special reference to the metazoan 18S ribosomal RNA phylogeny. , 2002, Systematic biology.

[103]  Céline Brochier,et al.  An emerging phylogenetic core of Archaea: phylogenies of transcription and translation machineries converge following addition of new genome sequences , 2005, BMC Evolutionary Biology.

[104]  Nicholas Hamilton,et al.  Phylogenetic identification of lateral genetic transfer events , 2006, BMC Evolutionary Biology.

[105]  Ziheng Yang,et al.  Molecular Evolution: A Statistical Approach , 2014 .

[106]  S. Ho,et al.  Improving the analysis of dinoflagellate phylogeny based on rDNA. , 2005, Protist.

[107]  Edward Susko,et al.  Testing for covarion-like evolution in protein sequences. , 2007, Molecular biology and evolution.

[108]  Jeremiah D. Degenhardt,et al.  Testing nested phylogenetic and phylogeographic hypotheses in the Plethodon vandykei species group. , 2004, Systematic biology.

[109]  Sarah J. Bourlat,et al.  Deuterostome phylogeny reveals monophyletic chordates and the new phylum Xenoturbellida , 2006, Nature.

[110]  D. Posada,et al.  Selecting models of nucleotide substitution: an application to human immunodeficiency virus 1 (HIV-1). , 2001, Molecular biology and evolution.

[111]  Zaid Abdo,et al.  Performance-based selection of likelihood models for phylogeny estimation. , 2003, Systematic biology.

[112]  Joel Sjöstrand,et al.  A Bayesian method for analyzing lateral gene transfer. , 2014, Systematic biology.

[113]  A. Bowker,et al.  A test for symmetry in contingency tables. , 1948, Journal of the American Statistical Association.

[114]  Yu Zhao,et al.  SeqVis: Visualization of compositional heterogeneity in large alignments of nucleotides , 2006, Bioinform..

[115]  S. Muse,et al.  A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. , 1994, Molecular biology and evolution.

[116]  W. Brown,et al.  Amphioxus mitochondrial DNA, chordate phylogeny, and the limits of inference based on comparisons of sequences. , 1998, Systematic biology.

[117]  Aristotelis Tsirigos,et al.  A new computational method for the detection of horizontal gene transfer events , 2005, Nucleic acids research.

[118]  Faisal Ababneh,et al.  Hetero: a program to simulate the evolution of DNA on a four-taxon tree. , 2003, Applied bioinformatics.

[119]  Marc A Suchard,et al.  Fast, accurate and simulation-free stochastic mapping , 2008, Philosophical Transactions of the Royal Society B: Biological Sciences.

[120]  S. Whelan,et al.  Statistical tests of gamma-distributed rate heterogeneity in models of sequence evolution in phylogenetics. , 2000, Molecular biology and evolution.

[121]  D. Posada jModelTest: phylogenetic model averaging. , 2008, Molecular biology and evolution.

[122]  Jonathan Romiguier,et al.  Efficient selection of branch-specific models of sequence evolution. , 2012, Molecular biology and evolution.

[123]  D Penny,et al.  A frequency-dependent significance test for parsimony. , 1995, Molecular phylogenetics and evolution.

[124]  Martin Vingron,et al.  Modeling Amino Acid Replacement , 2000, J. Comput. Biol..

[125]  Simon Whelan,et al.  Distributions of statistics used for the comparison of models of sequence evolution in phylogenetics , 1999 .

[126]  Tal Pupko,et al.  A covarion-based method for detecting molecular adaptation: application to the evolution of primate mitochondrial genomes , 2002, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[127]  Edward Susko,et al.  PROCOV: maximum likelihood estimation of protein phylogeny under covarion models and site-specific covarion pattern analysis , 2009, BMC Evolutionary Biology.

[128]  D. Posada Evaluation of methods for detecting recombination from DNA sequences: empirical data. , 2002, Molecular biology and evolution.

[129]  Faisal Ababneh,et al.  Phylogenetic model evaluation. , 2008, Methods in molecular biology.

[130]  Magnus Rattray,et al.  RNA-based phylogenetic methods: application to mammalian mitochondrial RNA sequences. , 2003, Molecular phylogenetics and evolution.

[131]  Daniel Stubbs,et al.  PhyloBayes MPI: phylogenetic reconstruction with infinite mixtures of profiles in a parallel environment. , 2013, Systematic biology.

[132]  A. Stuart A TEST FOR HOMOGENEITY OF THE MARGINAL DISTRIBUTIONS IN A TWO-WAY CLASSIFICATION , 1955 .

[133]  J A Eisen,et al.  Phylogenomics: improving functional predictions for uncharacterized genes by evolutionary analysis. , 1998, Genome research.

[134]  Faisal Ababneh,et al.  Generation of the Exact Distribution and Simulation of Matched Nucleotide Sequences on a Phylogenetic Tree , 2006, J. Math. Model. Algorithms.

[135]  M. Steel,et al.  A covariotide model explains apparent phylogenetic structure of oxygenic photosynthetic lineages. , 1998, Molecular biology and evolution.

[136]  Z. Yang,et al.  Among-site rate variation and its impact on phylogenetic analyses. , 1996, Trends in ecology & evolution.

[137]  K. de Queiroz,et al.  Toward a phylogenetic system of biological nomenclature. , 1994, Trends in ecology & evolution.

[138]  N. Galtier,et al.  Maximum-likelihood phylogenetic analysis under a covarion-like model. , 2001, Molecular biology and evolution.

[139]  M. Rattray,et al.  Bayesian phylogenetics using an RNA substitution model applied to early mammalian evolution. , 2002, Molecular biology and evolution.

[140]  O. Gascuel,et al.  An improved general amino acid replacement matrix. , 2008, Molecular biology and evolution.

[141]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[142]  D. Hoyle,et al.  RNA sequence evolution with secondary structure constraints: comparison of substitution rate models using maximum-likelihood methods. , 2001, Genetics.

[143]  Rajeev K. Azad,et al.  Use of Artificial Genomes in Assessing Methods for Atypical Gene Detection , 2005, PLoS Comput. Biol..

[144]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[145]  S. Ho,et al.  Tracing the decay of the historical signal in biological sequence data. , 2004, Systematic biology.

[146]  Vivek Gowri-Shankar,et al.  Consideration of RNA secondary structure significantly improves likelihood-based estimates of phylogeny: examples from the bilateria. , 2005, Molecular biology and evolution.

[147]  Y. Inagaki,et al.  Testing for differences in rates-across-sites distributions in phylogenetic subtrees. , 2002, Molecular biology and evolution.

[148]  Olivier Gascuel,et al.  Modeling protein evolution with several amino acid replacement matrices depending on site rates. , 2012, Molecular biology and evolution.

[149]  N. Sugiura Further analysts of the data by akaike' s information criterion and the finite corrections , 1978 .

[150]  M. Steel,et al.  General time-reversible distances with unequal rates across sites: mixing gamma and inverse Gaussian distributions with invariant sites. , 1997, Molecular phylogenetics and evolution.

[151]  M. Nei,et al.  A new method of inference of ancestral nucleotide and amino acid sequences. , 1995, Genetics.

[152]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[153]  M. A. Steel,et al.  Confidence in evolutionary trees from biological sequence data , 1993, Nature.

[154]  C. Field,et al.  The parameters of the Barry and Hartigan general Markov model are statistically nonidentifiable. , 2011, Systematic biology.

[155]  M. Gouy,et al.  Inferring phylogenies from DNA sequences of unequal base compositions. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[156]  Thomas K. F. Wong,et al.  Phylogenomics resolves the timing and pattern of insect evolution , 2014, Science.

[157]  Alexei J Drummond,et al.  Choosing appropriate substitution models for the phylogenetic analysis of protein-coding sequences. , 2006, Molecular biology and evolution.

[158]  H. Philippe,et al.  A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. , 2004, Molecular biology and evolution.

[159]  D Penny,et al.  Evolution of chlorophyll and bacteriochlorophyll: the problem of invariant sites in sequence analysis. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[160]  David Posada,et al.  ProtTest: selection of best-fit models of protein evolution , 2005, Bioinform..

[161]  A. von Haeseler,et al.  A stochastic model for the evolution of autocorrelated DNA sequences. , 1994, Molecular phylogenetics and evolution.

[162]  E. Susko,et al.  General heterotachy and distance method adjustments. , 2009, Molecular biology and evolution.

[163]  S. Ho,et al.  Molecular phylogeny of Australian Helicarionidae, Euconulidae and related groups (Gastropoda: Pulmonata: Stylommatophora) based on mitochondrial DNA. , 2007, Molecular phylogenetics and evolution.