Parallel Likelihood Calculation for Phylogenetic Comparative Models: the SPLITT C++ Library

Phylogenetic comparative methods have been used to model trait evolution, to test selection versus neutral hypotheses, to estimate optimal trait-values, and to quantify the rate of adaptation towards these optima. Several authors have proposed algorithms calculating the likelihood for trait evolution models, such as the Ornstein-Uhlenbeck (OU) process, in time proportional to the number of tips in the tree. Combined with gradient-based optimization, these algorithms enable maximum likelihood (ML) inference within seconds, even for trees exceeding 10,000 tips. Despite its useful statistical properties, ML has been criticised for being a point estimator prone to getting stuck in local optima. As an elegant alternative, Bayesian inference explores the entire information in the data and compares it to prior knowledge but, usually, needs much longer time, even on small trees. Here, we propose an approach to use the full potential of ML and Bayesian inference, while keeping the runtime within minutes. Our approach combines (i) a new algorithm for parallel traversal of the lineages in the tree, enabling parallel calculation of the likelihood; (ii) a previously published method for adaptive Metropolis sampling. In principle, the strategy of (i) and (ii) can be applied to any likelihood calculation on a tree which proceeds in a pruning-like fashion, leading to enormous speed improvements. We implement several variants of the parallel algorithm in the form of a generic C++ library, "SPLiTTree", capable to choose automatically the optimal algorithm for a given task and computing platform. We give examples of models of discrete and continuous trait evolution that are amenable to parallel likelihood calculation. As a complete showcase, we implement the phylogenetic Ornstein-Uhlenbeck mixed model (POUMM) in the form of an easy-to-use and highly configurable R-package that calls the library as a back-end. In addition to the above-mentioned usage of comparative methods, POUMM allows to estimate non-heritable variance and phylogenetic heritability. Using SPLiTTree, calculating the POUMM likelihood on a 4-core SIMD-enabled processor is up to 10 times faster than serial implementations written in C and hundreds of times faster than serial implementations written in R. By combining SPLiTTree likelihood calculation with adaptive Metropolis sampling, the time for Bayesian POUMM inference on a tree of ten thousand tips is reduced from several days to a few minutes.

[1]  G. Grimmett,et al.  Probability and random processes , 2002 .

[2]  E. Paradis,et al.  Analysis of comparative data using generalized estimating equations. , 2002, Journal of theoretical biology.

[3]  Veronika Boskova,et al.  Inference of Epidemiological Dynamics Based on Simulated Phylogenies Using Birth-Death and Coalescent Models , 2014, PLoS Comput. Biol..

[4]  L. Harmon,et al.  A novel Bayesian method for inferring and interpreting the dynamics of adaptive landscapes from phylogenetic comparative data , 2014, bioRxiv.

[5]  J. Felsenstein Phylogenies and the Comparative Method , 1985, The American Naturalist.

[6]  Eric Durand,et al.  apTreeshape: statistical analysis of phylogenetic tree shape , 2006, Bioinform..

[7]  M. Lynch,et al.  The Phylogenetic Mixed Model , 2004, The American Naturalist.

[8]  L. Harmon,et al.  INTEGRATING FOSSILS WITH MOLECULAR PHYLOGENIES IMPROVES INFERENCE OF TRAIT EVOLUTION , 2012, Evolution; international journal of organic evolution.

[9]  Daniel Wegmann,et al.  FITTING MODELS OF CONTINUOUS TRAIT EVOLUTION TO INCOMPLETELY SAMPLED COMPARATIVE DATA USING APPROXIMATE BAYESIAN COMPUTATION , 2012, Evolution; international journal of organic evolution.

[10]  Anthony Brockwell Parallel Markov chain Monte Carlo Simulation by Pre-Fetching , 2006 .

[11]  R. FitzJohn Diversitree: comparative phylogenetic analyses of diversification in R , 2012 .

[12]  Jorge Nocedal,et al.  A Limited Memory Algorithm for Bound Constrained Optimization , 1995, SIAM J. Sci. Comput..

[13]  G. Merceron,et al.  mvmorph: an r package for fitting multivariate evolutionary models to morphometric data , 2015 .

[14]  C. Ané,et al.  A linear-time algorithm for Gaussian and non-Gaussian trait evolution models. , 2014, Systematic biology.

[15]  Donald B. Rubin,et al.  Validation of Software for Bayesian Models Using Posterior Quantiles , 2006 .

[16]  Andrew Gelman,et al.  General methods for monitoring convergence of iterative simulations , 1998 .

[17]  P. David,et al.  Diversity spurs diversification in ecological communities , 2017, Nature Communications.

[18]  Stephen P. Boyd,et al.  Convex Optimization , 2004, Algorithms and Theory of Computation Handbook.

[19]  C. Fraser,et al.  How effectively can HIV phylogenies be used to measure heritability? , 2013, Evolution, medicine, and public health.

[20]  Tanja Stadler,et al.  Phylodynamics with Migration: A Computational Framework to Quantify Population Structure from Genomic Data , 2016, Molecular biology and evolution.

[21]  J. Hadfield,et al.  The Contribution of Viral Genotype to Plasma Viral Set-Point in HIV Infection , 2014, PLoS pathogens.

[22]  H. Günthard,et al.  Parent-offspring regression to estimate the heritability of an HIV-1 trait in a realistic setup , 2017, Retrovirology.

[23]  Dong Xie,et al.  BEAST 2: A Software Platform for Bayesian Evolutionary Analysis , 2014, PLoS Comput. Biol..

[24]  A. Telenti,et al.  Phylogenetic Approach Reveals That Virus Genotype Largely Determines HIV Set-Point Viral Load , 2010, PLoS pathogens.

[25]  Matti Vihola,et al.  Robust adaptive Metropolis algorithm with coerced acceptance rate , 2010, Statistics and Computing.

[26]  Serge Midonnet,et al.  A Stretching Algorithm for Parallel Real-time DAG Tasks on Multiprocessor Systems , 2014, RTNS.

[27]  Korbinian Strimmer,et al.  APE: Analyses of Phylogenetics and Evolution in R language , 2004, Bioinform..

[28]  A. King,et al.  Phylogenetic Comparative Analysis: A Modeling Approach for Adaptive Evolution , 2004, The American Naturalist.

[29]  M. Suchard,et al.  Bayesian Phylogenetics with BEAUti and the BEAST 1.7 , 2012, Molecular biology and evolution.

[30]  M. Pagel Detecting correlated evolution on phylogenies: a general method for the comparative analysis of discrete characters , 1994, Proceedings of the Royal Society of London. Series B: Biological Sciences.

[31]  H. Innan,et al.  SIMULATION‐BASED LIKELIHOOD APPROACH FOR EVOLUTIONARY MODELS OF PHENOTYPIC TRAITS ON PHYLOGENY , 2013, Evolution; international journal of organic evolution.

[32]  T. Garland,et al.  Phylogenetic logistic regression for binary dependent variables. , 2010, Systematic biology.

[33]  J. Felsenstein Maximum-likelihood estimation of evolutionary trees from continuous characters. , 1973, American journal of human genetics.

[34]  S. Bonhoeffer,et al.  Birth–death skyline plot reveals temporal changes of epidemic spread in HIV and hepatitis C virus (HCV) , 2012, Proceedings of the National Academy of Sciences.

[35]  Joseph Felsenstein,et al.  Statistical inference of phylogenies , 1983 .

[36]  M. Plummer,et al.  CODA: convergence diagnosis and output analysis for MCMC , 2006 .

[37]  B. O’Meara Evolutionary Inferences from Phylogenies: A Review of Methods , 2012 .

[38]  Tanja Stadler,et al.  The Structured Coalescent and Its Approximations , 2016, bioRxiv.

[39]  Amaury Lambert,et al.  A Unifying Comparative Phylogenetic Framework Including Traits Coevolving Across Interacting Lineages. , 2016, Systematic biology.

[40]  T. Stadler,et al.  The Heritability of Pathogen Traits - Definitions and Estimators , 2017, bioRxiv.

[41]  H. Haario,et al.  An adaptive Metropolis algorithm , 2001 .

[42]  G. Uhlenbeck,et al.  On the Theory of the Brownian Motion , 1930 .

[43]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[44]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[45]  Luke J. Harmon,et al.  Geiger V2.0: an Expanded Suite of Methods for Fitting Macroevolutionary Models to Phylogenetic Trees , 2014, Bioinform..

[46]  J. Bruggeman,et al.  Rphylopars: fast multivariate phylogenetic comparative methods for missing data and within‐species variation , 2017 .

[47]  John P. Huelsenbeck,et al.  MrBayes 3: Bayesian phylogenetic inference under mixed models , 2003, Bioinform..

[48]  Simon P. Wilson,et al.  Parallel algorithms for Markov chain Monte Carlo methods in latent spatial Gaussian models , 2004, Stat. Comput..

[49]  Daniel L. Ayres,et al.  BEAGLE: An Application Programming Interface and High-Performance Computing Library for Statistical Phylogenetics , 2011, Systematic biology.

[50]  J. Losos,et al.  Seeing the Forest for the Trees: The Limitations of Phylogenies in Comparative Biology , 2011, The American Naturalist.