Software for Detecting Heterogeneous Evolutionary Processes across Aligned Sequence Data

Abstract Most model-based molecular phylogenetic methods assume that the sequences diverged on a tree under homogeneous conditions. If evolution occurred under these conditions, then it is unlikely that the sequences would become compositionally heterogeneous. Conversely, if the sequences are compositionally heterogeneous, then it is unlikely that they have evolved under homogeneous conditions. We present methods to detect and analyse heterogeneous evolution in aligned sequence data and to examine—visually and numerically—its effect on phylogenetic estimates. The methods are implemented in three programs, allowing users to better examine under what conditions their phylogenetic data may have evolved.

[1]  A. Stuart A TEST FOR HOMOGENEITY OF THE MARGINAL DISTRIBUTIONS IN A TWO-WAY CLASSIFICATION , 1955 .

[2]  J. Huelsenbeck,et al.  SUCCESS OF PHYLOGENETIC METHODS IN THE FOUR-TAXON CASE , 1993 .

[3]  E. Spjøtvoll,et al.  Plots of P-values to evaluate many tests simultaneously , 1982 .

[4]  D Penny,et al.  Progress with methods for constructing evolutionary trees. , 1992, Trends in ecology & evolution.

[5]  Nick Goldman,et al.  Statistical tests of models of DNA substitution , 1993, Journal of Molecular Evolution.

[6]  V. Moulton,et al.  Neighbor-net: an agglomerative method for the construction of phylogenetic networks. , 2002, Molecular biology and evolution.

[7]  Faisal Ababneh,et al.  The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. , 2004, Systematic biology.

[8]  S. Ho,et al.  New Statistical Criteria Detect Phylogenetic Bias Caused by Compositional Heterogeneity , 2017, Molecular biology and evolution.

[9]  L. Foulds,et al.  Testing the theory of evolution by comparing phylogenetic trees constructed from five different protein sequences , 1982, Nature.

[10]  S. Ho,et al.  Differences in Performance among Test Statistics for Assessing Phylogenomic Model Adequacy , 2018, Genome biology and evolution.

[11]  Faisal Ababneh,et al.  Generation of the Exact Distribution and Simulation of Matched Nucleotide Sequences on a Phylogenetic Tree , 2006, J. Math. Model. Algorithms.

[12]  D. Huson,et al.  Application of phylogenetic networks in evolutionary studies. , 2006, Molecular biology and evolution.

[13]  Manuel A. S. Santos,et al.  Evolution of pathogenicity and sexual reproduction in eight Candida genomes , 2009, Nature.

[14]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[15]  Vivek Jayaswal,et al.  Estimation of phylogeny and invariant sites under the general Markov model of nucleotide sequence evolution. , 2007, Systematic biology.

[16]  V. Pawlowsky-Glahn,et al.  Compositional data analysis : theory and applications , 2011 .

[17]  Tom Fearn,et al.  Sensitivity and Specificity , 2009 .

[18]  John Aitchison,et al.  The Statistical Analysis of Compositional Data , 1986 .

[19]  Stephanie T. Lanza,et al.  Sensitivity and Specificity of Information Criteria , 2018, bioRxiv.

[20]  David Bryant,et al.  Likelihood calculation in molecular phylogenetics , 2007, Mathematics of Evolution and Phylogeny.

[21]  Vera Pawlowsky-Glahn,et al.  Basic Concepts and Procedures , 2011 .

[22]  M. Kuhner,et al.  Practical performance of tree comparison metrics. , 2015, Systematic biology.

[23]  Thomas K. F. Wong,et al.  Mixture models of nucleotide sequence evolution that account for heterogeneity in the substitution process across sites and across lineages. , 2014, Systematic biology.

[24]  Edward Susko,et al.  On reduced amino acid alphabets for phylogenetic inference. , 2007, Molecular biology and evolution.

[25]  R. H. Thomas,et al.  Reduced thermophilic bias in the 16S rDNA sequence from Thermus ruber provides further support for a relationship between Thermus and Deinococcus , 1993 .

[26]  M. Steel,et al.  General time-reversible distances with unequal rates across sites: mixing gamma and inverse Gaussian distributions with invariant sites. , 1997, Molecular phylogenetics and evolution.

[27]  Peter G Foster,et al.  Modeling compositional heterogeneity. , 2004, Systematic biology.

[28]  Thomas Mailund,et al.  tqDist: a library for computing the quartet and triplet distances between binary or general trees , 2014, Bioinform..

[29]  W N Grundy,et al.  Phylogenetic inference from conserved sites alignments. , 1999, The Journal of experimental zoology.

[30]  R. Lanfear,et al.  The Prevalence and Impact of Model Violations in Phylogenetic Analysis , 2019, Genome biology and evolution.

[31]  M. Gouy,et al.  Inferring phylogenies from DNA sequences of unequal base compositions. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[32]  D. Posada Bioinformatics for DNA Sequence Analysis , 2009, Methods in Molecular Biology.

[33]  S. Ho,et al.  Tracing the decay of the historical signal in biological sequence data. , 2004, Systematic biology.

[34]  Nick Goldman,et al.  A new criterion and method for amino acid classification. , 2004, Journal of theoretical biology.

[35]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[36]  Ziheng Yang,et al.  Molecular Evolution: A Statistical Approach , 2014 .

[37]  Faisal Ababneh,et al.  Matched-pairs tests of homogeneity with applications to homologous nucleotide sequences , 2006, Bioinform..

[38]  W. T. Williams,et al.  ON THE COMPARISON OF TWO CLASSIFICATIONS OF THE SAME SET OF ELEMENTS , 1971 .

[39]  R. Reyment Compositional data analysis , 1989 .

[40]  E. Holmes,et al.  The evolution of base composition and phylogenetic inference. , 2000, Trends in ecology & evolution.

[41]  P J Waddell,et al.  Using novel phylogenetic methods to evaluate mammalian mtDNA, including amino acid-invariant sites-LogDet plus site stripping, to detect internal conflicts in the data, with special reference to the positions of hedgehog, armadillo, and elephant. , 1999, Systematic biology.

[42]  D. Morrison,et al.  Using data-display networks for exploratory data analysis in phylogenetic studies. , 2010, Molecular biology and evolution.

[43]  Lauren E. Helgen,et al.  Fixed, free, and fixed: the fickle phylogeny of extant Crinoidea (Echinodermata) and their Permian-Triassic origin. , 2013, Molecular phylogenetics and evolution.

[44]  W. Gilks,et al.  A novel algorithm and web-based tool for comparing two alternative phylogenetic trees , 2006, Bioinform..

[45]  Olivier Gascuel,et al.  FastME 2.0: A Comprehensive, Accurate, and Fast Distance-Based Phylogeny Inference Program , 2015, Molecular biology and evolution.

[46]  M. Steel Recovering a tree from the leaf colourations it generates under a Markov model , 1994 .

[47]  A. Bowker,et al.  A test for symmetry in contingency tables. , 1948, Journal of the American Statistical Association.

[48]  J. Huelsenbeck,et al.  Application and accuracy of molecular phylogenies. , 1994, Science.

[49]  D. Robinson,et al.  Comparison of phylogenetic trees , 1981 .

[50]  J. Lake,et al.  Reconstructing evolutionary trees from DNA and protein sequences: paralinear distances. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[51]  A. von Haeseler,et al.  IQ-TREE: A Fast and Effective Stochastic Algorithm for Estimating Maximum-Likelihood Phylogenies , 2014, Molecular biology and evolution.