Birth-death prior on phylogeny and speed dating

BackgroundIn recent years there has been a trend of leaving the strict molecular clock in order to infer dating of speciations and other evolutionary events. Explicit modeling of substitution rates and divergence times makes formulation of informative prior distributions for branch lengths possible. Models with birth-death priors on tree branching and auto-correlated or iid substitution rates among lineages have been proposed, enabling simultaneous inference of substitution rates and divergence times. This problem has, however, mainly been analysed in the Markov chain Monte Carlo (MCMC) framework, an approach requiring computation times of hours or days when applied to large phylogenies.ResultsWe demonstrate that a hill-climbing maximum a posteriori (MAP) adaptation of the MCMC scheme results in considerable gain in computational efficiency. We demonstrate also that a novel dynamic programming (DP) algorithm for branch length factorization, useful both in the hill-climbing and in the MCMC setting, further reduces computation time. For the problem of inferring rates and times parameters on a fixed tree, we perform simulations, comparisons between hill-climbing and MCMC on a plant rbcL gene dataset, and dating analysis on an animal mtDNA dataset, showing that our methodology enables efficient, highly accurate analysis of very large trees. Datasets requiring a computation time of several days with MCMC can with our MAP algorithm be accurately analysed in less than a minute. From the results of our example analyses, we conclude that our methodology generally avoids getting trapped early in local optima. For the cases where this nevertheless can be a problem, for instance when we in addition to the parameters also infer the tree topology, we show that the problem can be evaded by using a simulated-annealing like (SAL) method in which we favour tree swaps early in the inference while biasing our focus towards rate and time parameter changes later on.ConclusionOur contribution leaves the field open for fast and accurate dating analysis of nucleotide sequence data. Modeling branch substitutions rates and divergence times separately allows us to include birth-death priors on the times without the assumption of a molecular clock. The methodology is easily adapted to take data from fossil records into account and it can be used together with a broad range of rate and substitution models.

[1]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[2]  D. Rubin,et al.  Inference from Iterative Simulation Using Multiple Sequences , 1992 .

[3]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[4]  Z. Yang,et al.  Estimation of primate speciation dates using local molecular clocks. , 2000, Molecular biology and evolution.

[5]  J. Gillespie The causes of molecular evolution , 1991 .

[6]  Ziheng Yang,et al.  Divergence dates for Malagasy lemurs estimated from multiple gene loci: geological and evolutionary context , 2004, Molecular ecology.

[7]  Bengt Sennblad,et al.  Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolution , 2004, RECOMB.

[8]  Stéphane Aris-Brosou,et al.  Effects of models of rate evolution on estimation of divergence dates with special reference to the metazoan 18S ribosomal RNA phylogeny. , 2002, Systematic biology.

[9]  Michael J. Sanderson,et al.  A Nonparametric Approach to Estimating Divergence Times in the Absence of Rate Constancy , 1997 .

[10]  Mark W. Chase,et al.  The earliest angiosperms: evidence from mitochondrial, plastid and nuclear genomes , 1999, Nature.

[11]  Hani Doss,et al.  Phylogenetic Tree Construction Using Markov Chain Monte Carlo , 2000 .

[12]  M. Plummer,et al.  CODA: convergence diagnosis and output analysis for MCMC , 2006 .

[13]  Ziheng Yang,et al.  Branch-length prior influences Bayesian posterior probability of phylogeny. , 2005, Systematic biology.

[14]  B. Rannala,et al.  Probability distribution of molecular evolutionary trees: A new method of phylogenetic inference , 1996, Journal of Molecular Evolution.

[15]  B. Rannala,et al.  Bayesian phylogenetic inference using DNA sequences: a Markov Chain Monte Carlo Method. , 1997, Molecular biology and evolution.

[16]  A. Janke,et al.  Molecular estimates of primate divergences and new hypotheses for primate dispersal and the origin of modern humans. , 2004, Hereditas.

[17]  L. Pauling,et al.  Evolutionary Divergence and Convergence in Proteins , 1965 .

[18]  Ziheng Yang,et al.  Inferring speciation times under an episodic molecular clock. , 2007, Systematic biology.

[19]  Elizabeth A. Thompson,et al.  Human Evolutionary Trees , 1975 .

[20]  T. Britton Estimating divergence times in phylogenetic trees without a molecular clock. , 2005, Systematic biology.

[21]  H. Munro,et al.  Mammalian protein metabolism , 1964 .

[22]  H. Kishino,et al.  Estimating the rate of evolution of the rate of molecular evolution. , 1998, Molecular biology and evolution.

[23]  Ziheng Yang,et al.  Bayesian estimation of species divergence times under a molecular clock using multiple fossil calibrations with soft bounds. , 2006, Molecular biology and evolution.

[24]  H. Kishino,et al.  Estimation of branching dates among primates by molecular clocks of nuclear DNA which slowed down in Hominoidea , 1989 .

[25]  S. Ho,et al.  Relaxed Phylogenetics and Dating with Confidence , 2006, PLoS biology.

[26]  Bengt Sennblad,et al.  Bayesian gene/species tree reconciliation and orthology analysis using MCMC , 2003, ISMB.

[27]  S. Jeffery Evolution of Protein Molecules , 1979 .

[28]  V. Bryson,et al.  Evolving Genes and Proteins. , 1965, Science.

[29]  Thomas Ludwig,et al.  A fast program for maximum likelihood-based inference of large phylogenetic trees , 2004, SAC '04.

[30]  M. Sanderson Estimating absolute rates of molecular evolution and divergence times: a penalized likelihood approach. , 2002, Molecular biology and evolution.

[31]  W. Bruno,et al.  Performance of a divergence time estimation method under a probabilistic model of rate evolution. , 2001, Molecular biology and evolution.

[32]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[33]  D. Bryant,et al.  A general comparison of relaxed molecular clock models. , 2007, Molecular biology and evolution.

[34]  M A Newton,et al.  Bayesian Phylogenetic Inference via Markov Chain Monte Carlo Methods , 1999, Biometrics.

[35]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[36]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[37]  Andrea Omicini,et al.  Proceedings of the 2004 ACM Symposium on Applied Computing (SAC 2004) , 2004 .