Fast Dating Using Least-Squares Criteria and Algorithms

Phylogenies provide a useful way to understand the evolutionary history of genetic samples, and data sets with more than a thousand taxa are becoming increasingly common, notably with viruses (e.g., human immunodeficiency virus (HIV)). Dating ancestral events is one of the first, essential goals with such data. However, current sophisticated probabilistic approaches struggle to handle data sets of this size. Here, we present very fast dating algorithms, based on a Gaussian model closely related to the Langley–Fitch molecular-clock model. We show that this model is robust to uncorrelated violations of the molecular clock. Our algorithms apply to serial data, where the tips of the tree have been sampled through times. They estimate the substitution rate and the dates of all ancestral nodes. When the input tree is unrooted, they can provide an estimate for the root position, thus representing a new, practical alternative to the standard rooting methods (e.g., midpoint). Our algorithms exploit the tree (recursive) structure of the problem at hand, and the close relationships between least-squares and linear algebra. We distinguish between an unconstrained setting and the case where the temporal precedence constraint (i.e., an ancestral node must be older that its daughter nodes) is accounted for. With rooted trees, the former is solved using linear algebra in linear computing time (i.e., proportional to the number of taxa), while the resolution of the latter, constrained setting, is based on an active-set method that runs in nearly linear time. With unrooted trees the computing time becomes (nearly) quadratic (i.e., proportional to the square of the number of taxa). In all cases, very large input trees (>10,000 taxa) can easily be processed and transformed into time-scaled trees. We compare these algorithms to standard methods (root-to-tip, r8s version of Langley–Fitch method, and BEAST). Using simulated data, we show that their estimation accuracy is similar to that of the most sophisticated methods, while their computing time is much faster. We apply these algorithms on a large data set comprising 1194 strains of Influenza virus from the pdm09 H1N1 Human pandemic. Again the results show that these algorithms provide a very fast alternative with results similar to those of other computer programs. These algorithms are implemented in the LSD software (least-squares dating), which can be downloaded from http://www.atgc-montpellier.fr/LSD/, along with all our data sets and detailed results. An Online Appendix, providing additional algorithm descriptions, tables, and figures can be found in the Supplementary Material available on Dryad at http://dx.doi.org/10.5061/dryad.968t3.

[1]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[2]  Daniel L. Ayres,et al.  BEAGLE: An Application Programming Interface and High-Performance Computing Library for Statistical Phylogenetics , 2011, Systematic biology.

[3]  Ziheng Yang,et al.  The Timetree of Life , 2010 .

[4]  J. Lagergren,et al.  Simultaneous Bayesian gene tree reconstruction and reconciliation analysis , 2009, Proceedings of the National Academy of Sciences.

[5]  W. Fitch,et al.  Construction of phylogenetic trees. , 1967, Science.

[6]  O. Pybus,et al.  Inference of viral evolutionary rates from molecular sequences. , 2003, Advances in parasitology.

[7]  J. Margolick,et al.  Consistent Viral Evolutionary Changes Associated with the Progression of Human Immunodeficiency Virus Type 1 Infection , 1999, Journal of Virology.

[8]  O Gascuel,et al.  BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. , 1997, Molecular biology and evolution.

[9]  S. Guindon Bayesian estimation of divergence times from large sequence alignments. , 2010, Molecular biology and evolution.

[10]  Charles H. Langley,et al.  An examination of the constancy of the rate of molecular evolution , 2005, Journal of Molecular Evolution.

[11]  Zhu Yang,et al.  Tree and rate estimation by local evaluation of heterochronous nucleotide data , 2006, Bioinform..

[12]  O. Gascuel,et al.  A phylotype-based analysis highlights the role of drug-naive HIV-positive individuals in the transmission of antiretroviral resistance in the UK , 2015, AIDS.

[13]  M. Suchard,et al.  Bayesian Phylogenetics with BEAUti and the BEAST 1.7 , 2012, Molecular biology and evolution.

[14]  S. Ho,et al.  Relaxed Phylogenetics and Dating with Confidence , 2006, PLoS biology.

[15]  Michael J. Sanderson,et al.  R8s: Inferring Absolute Rates of Molecular Evolution, Divergence times in the Absence of a Molecular Clock , 2003, Bioinform..

[16]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[17]  Xuhua Xia,et al.  A distance-based least-square method for dating speciation events. , 2011, Molecular phylogenetics and evolution.

[18]  Trevor Bedford,et al.  Viral Phylodynamics , 2013, PLoS Comput. Biol..

[19]  Andrew Rambaut,et al.  Reconstructing the initial global spread of a human influenza pandemic: A Bayesian spatial-temporal model for the global spread of H1N1pdm. , 2009, PLoS currents.

[20]  Bengt Sennblad,et al.  Birth-death prior on phylogeny and speed dating , 2008, BMC Evolutionary Biology.

[21]  O. Gascuel,et al.  New algorithms and methods to estimate maximum-likelihood phylogenies: assessing the performance of PhyML 3.0. , 2010, Systematic biology.

[22]  Ziheng Yang,et al.  Inferring speciation times under an episodic molecular clock. , 2007, Systematic biology.

[23]  T. Britton,et al.  Estimating divergence times in large phylogenetic trees. , 2007, Systematic biology.

[24]  T. Stadler Sampling-through-time in birth-death trees. , 2010, Journal of theoretical biology.

[25]  M. Stanhope,et al.  Local Molecular Clocks in Three Nuclear Genes: Divergence Times for Rodents and Other Mammals and Incompatibility Among Fossil Calibrations , 2003, Journal of Molecular Evolution.

[26]  Andrew Rambaut,et al.  The early molecular epidemiology of the swine-origin A/H1N1 human influenza pandemic , 2009, PLoS currents.

[27]  Andrew Rambaut,et al.  Estimating the rate of molecular evolution: incorporating non-contemporaneous sequences into maximum likelihood phylogenies , 2000, Bioinform..

[28]  C. Millar,et al.  Rates of Evolution in Ancient DNA from Adélie Penguins , 2002, Science.

[29]  Effrey,et al.  Divergence Time and Evolutionary Rate Estimation with Multilocus Data , 2002 .

[30]  A. Rodrigo,et al.  Measurably evolving populations , 2003 .

[31]  H. Kishino,et al.  Estimating the rate of evolution of the rate of molecular evolution. , 1998, Molecular biology and evolution.

[32]  A. Rodrigo,et al.  Reconstructing genealogies of serial samples under the assumption of a molecular clock using serial-sample UPGMA. , 2000, Molecular biology and evolution.

[33]  Katta G. Murty,et al.  Linear complementarity, linear and nonlinear programming , 1988 .

[34]  Robert R. Sokal,et al.  A statistical method for evaluating systematic relationships , 1958 .

[35]  Robert E. Tarjan,et al.  Data structures and network algorithms , 1983, CBMS-NSF regional conference series in applied mathematics.

[36]  M. Volz Erik,et al.  A gene genealogy illustrating internode intervals. , 2013 .

[37]  Vincent Berry,et al.  Models, algorithms and programs for phylogeny reconciliation , 2011, Briefings Bioinform..

[38]  Michael J. Sanderson,et al.  A Nonparametric Approach to Estimating Divergence Times in the Absence of Rate Constancy , 1997 .

[39]  Koichiro Tamura,et al.  Estimating divergence times in large molecular phylogenies , 2012, Proceedings of the National Academy of Sciences.

[40]  Olivier Gascuel,et al.  Fast and Accurate Phylogeny Reconstruction Algorithms Based on the Minimum-Evolution Principle , 2002, WABI.

[41]  Walter Jetz,et al.  Global Distribution and Conservation of Evolutionary Distinctness in Birds , 2014, Current Biology.

[42]  Olivier Gascuel,et al.  Fast and Accurate Phylogeny Reconstruction Algorithms Based on the Minimum-Evolution Principle , 2002, J. Comput. Biol..

[43]  O. O’Meara Introduction to quadratic forms , 1965 .

[44]  O. Pybus,et al.  Unifying the Epidemiological and Evolutionary Dynamics of Pathogens , 2004, Science.

[45]  Andrew Rambaut,et al.  Reconstructing the initial global spread of a human influenza pandemic , 2017 .

[46]  Brian C. O'Meara,et al.  treePL: divergence time estimation using penalized likelihood for large phylogenies , 2012, Bioinform..

[47]  Manolis Kellis,et al.  Unified modeling of gene duplication, loss, and coalescence using a locus tree. , 2012, Genome research.

[48]  A. Rambaut,et al.  Real-time characterization of the molecular epidemiology of an influenza pandemic , 2013, Biology Letters.

[49]  Sudhir Kumar,et al.  Performance of relaxed-clock methods in estimating evolutionary divergence times and their credibility intervals. , 2010, Molecular biology and evolution.

[50]  A. Rambaut,et al.  BEAST: Bayesian evolutionary analysis by sampling trees , 2007, BMC Evolutionary Biology.

[51]  Guoqing Lu,et al.  Pandemic (H1N1) 2009 virus revisited: an evolutionary retrospective. , 2011, Infection, genetics and evolution : journal of molecular epidemiology and evolutionary genetics in infectious diseases.

[52]  D. Bryant,et al.  A general comparison of relaxed molecular clock models. , 2007, Molecular biology and evolution.

[53]  M. Sanderson Estimating absolute rates of molecular evolution and divergence times: a penalized likelihood approach. , 2002, Molecular biology and evolution.

[54]  D K Smith,et al.  Numerical Optimization , 2001, J. Oper. Res. Soc..

[55]  K. Crandall,et al.  Evaluation of methods for detecting recombination from DNA sequences: Computer simulations , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[56]  K. Schittkowski,et al.  NONLINEAR PROGRAMMING , 2022 .

[57]  G. Bello,et al.  Origin and evolutionary history of HIV-1 subtype C in Brazil , 2008, AIDS.

[58]  Olivier Gascuel,et al.  FastME 2.0: A Comprehensive, Accurate, and Fast Distance-Based Phylogeny Inference Program , 2015, Molecular biology and evolution.

[59]  B. Casselman Introduction to quadratic forms , 2016 .

[60]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[61]  Seraina Klopfstein,et al.  A Total-Evidence Approach to Dating with Fossils, Applied to the Early Radiation of the Hymenoptera , 2012, Systematic biology.