Online Bayesian Phylodynamic Inference in BEAST with Application to Epidemic Reconstruction

Abstract Reconstructing pathogen dynamics from genetic data as they become available during an outbreak or epidemic represents an important statistical scenario in which observations arrive sequentially in time and one is interested in performing inference in an “online” fashion. Widely used Bayesian phylogenetic inference packages are not set up for this purpose, generally requiring one to recompute trees and evolutionary model parameters de novo when new data arrive. To accommodate increasing data flow in a Bayesian phylogenetic framework, we introduce a methodology to efficiently update the posterior distribution with newly available genetic data. Our procedure is implemented in the BEAST 1.10 software package, and relies on a distance-based measure to insert new taxa into the current estimate of the phylogeny and imputes plausible values for new model parameters to accommodate growing dimensionality. This augmentation creates informed starting values and re-uses optimally tuned transition kernels for posterior exploration of growing data sets, reducing the time necessary to converge to target posterior distributions. We apply our framework to data from the recent West African Ebola virus epidemic and demonstrate a considerable reduction in time required to obtain posterior estimates at different time points of the outbreak. Beyond epidemic monitoring, this framework easily finds other applications within the phylogenetics community, where changes in the data—in terms of alignment changes, sequence addition or removal—present common scenarios that can benefit from online inference.

[1]  Alexandros Stamatakis,et al.  PUmPER: phylogenies updated perpetually , 2014, Bioinform..

[2]  Simon Whelan,et al.  New approaches to phylogenetic tree search and their application to large numbers of protein alignments. , 2007, Systematic biology.

[3]  Bradley P. Carlin,et al.  Markov Chain Monte Carlo in Practice: A Roundtable Discussion , 1998 .

[4]  Trevor Bedford,et al.  Virus genomes reveal factors that spread and sustained the Ebola epidemic , 2017, Nature.

[5]  Vu C. Dinh,et al.  Effective Online Bayesian Phylogenetics via Sequential Monte Carlo with Guided Proposals , 2017, bioRxiv.

[6]  Timothy J. Robinson,et al.  Sequential Monte Carlo Methods in Practice , 2003 .

[7]  Marc A Suchard,et al.  Understanding Past Population Dynamics: Bayesian Coalescent-Based Modeling with Covariates. , 2016, Systematic biology.

[8]  S. Ho,et al.  Relaxed Phylogenetics and Dating with Confidence , 2006, PLoS biology.

[9]  N. R. Faria,et al.  Establishment and cryptic transmission of Zika virus in Brazil and the Americas , 2017, Nature.

[10]  S. Sampling theory for neutral alleles in a varying environment , 2003 .

[11]  S. Jeffery Evolution of Protein Molecules , 1979 .

[12]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[13]  Guy Baele,et al.  The epidemic dynamics of hepatitis C virus subtypes 4a and 4d in Saudi Arabia , 2017, Scientific Reports.

[14]  Richard A Neher,et al.  TreeTime: Maximum-likelihood phylodynamic analysis , 2017, bioRxiv.

[15]  Arnaud Doucet,et al.  Bayesian Phylogenetic Inference Using a Combinatorial Sequential Monte Carlo Method , 2015 .

[16]  M. Plummer,et al.  CODA: convergence diagnosis and output analysis for MCMC , 2006 .

[17]  M. Suchard,et al.  Gradients do grow on trees: a linear-time ${\cal O}\hspace{-0.2em}\left( N \right)$-dimensional gradient for statistical phylogenetics , 2019, 1905.12146.

[18]  Forrest W. Crawford,et al.  Unifying the spatial epidemiology and molecular evolution of emerging epidemics , 2012, Proceedings of the National Academy of Sciences.

[19]  Guy Baele,et al.  Recent advances in computational phylodynamics. , 2018, Current opinion in virology.

[20]  Babak Shahbaba,et al.  An efficient Bayesian inference framework for coalescent-based nonparametric phylodynamics , 2014, Bioinform..

[21]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[22]  Guy Baele,et al.  Phylodynamic assessment of intervention strategies for the West African Ebola virus outbreak , 2018, Nature Communications.

[23]  Rachel S. G. Sealfon,et al.  Genomic surveillance elucidates Ebola virus origin and transmission during the 2014 outbreak , 2014, Science.

[24]  Xiang Ji,et al.  Gradients do grow on trees: a linear-time 𝒪 (N)-dimensional gradient for statistical phylogenetics. , 2020, Molecular biology and evolution.

[25]  Gareth O. Roberts,et al.  Convergence assessment techniques for Markov chain Monte Carlo , 1998, Stat. Comput..

[26]  Benoit Morel,et al.  EPA-ng: Massively Parallel Evolutionary Placement of Genetic Sequences , 2018, bioRxiv.

[27]  Anthony J. Geneva,et al.  RWTY (R We There Yet): An R Package for Examining Convergence of Bayesian Phylogenetic Analyses. , 2017, Molecular biology and evolution.

[28]  Marco A. R. Ferreira,et al.  Bayesian analysis of elapsed times in continuous‐time Markov chains , 2008 .

[29]  J. Huelsenbeck,et al.  Efficiency of Markov chain Monte Carlo tree proposals in Bayesian phylogenetics. , 2008, Systematic biology.

[30]  Guy Baele,et al.  Emerging Concepts of Data Integration in Pathogen Phylodynamics , 2016, Systematic biology.

[31]  Cécile Viboud,et al.  Global migration of influenza A viruses in swine , 2015, Nature Communications.

[32]  M. Suchard,et al.  The early spread and epidemic ignition of HIV-1 in human populations , 2014, Science.

[33]  Paul Kellam,et al.  Rapid outbreak sequencing of Ebola virus in Sierra Leone identifies transmission chains linked to sporadic cases , 2016, Virus evolution.

[34]  Alexei J. Drummond,et al.  Bayesian Phylogeography Finds Its Roots , 2009, PLoS Comput. Biol..

[35]  Alexandros Stamatakis,et al.  Methods for automatic reference trees and multilevel phylogenetic placement , 2018, bioRxiv.

[36]  Z. Yang,et al.  Among-site rate variation and its impact on phylogenetic analyses. , 1996, Trends in ecology & evolution.

[37]  Radford M. Neal MCMC Using Hamiltonian Dynamics , 2011, 1206.1901.

[38]  J. Felsenstein,et al.  A Hidden Markov Model approach to variation among sites in rate of evolution. , 1996, Molecular biology and evolution.

[39]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[40]  Olivier Gascuel,et al.  Fast and Accurate Phylogeny Reconstruction Algorithms Based on the Minimum-Evolution Principle , 2002, WABI.

[41]  Karthik Gangavarapu,et al.  Genome sequencing reveals Zika virus diversity and spread in the Americas , 2017, bioRxiv.

[42]  Trevor Bedford,et al.  Nextstrain: real-time tracking of pathogen evolution , 2017, bioRxiv.

[43]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[44]  Nicolas Lartillot,et al.  PhyloBayes 3: a Bayesian software package for phylogenetic reconstruction and molecular dating , 2009, Bioinform..

[45]  M. Suchard,et al.  Unifying Viral Genetics and Human Transportation Data to Predict the Global Transmission Dynamics of Human Influenza H3N2 , 2014, PLoS pathogens.

[46]  Trevor Bedford,et al.  Ebola Virus Epidemiology, Transmission, and Evolution during Seven Months in Sierra Leone , 2015, Cell.

[47]  Liangliang Wang,et al.  An Annealed Sequential Monte Carlo Method for Bayesian Phylogenetics. , 2018, Systematic biology.

[48]  J. Felsenstein,et al.  PHYLIP: phylogenetic inference package version 3.5c. Distributed over the Internet , 1993 .

[49]  Robert M. Miura,et al.  Some mathematical questions in biology : DNA sequence analysis , 1986 .

[50]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[51]  David A. Matthews,et al.  Real-time, portable genome sequencing for Ebola surveillance , 2016, Nature.

[52]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[53]  Daniel L. Ayres,et al.  BEAGLE 3: Improved Performance, Scaling, and Usability for a High-Performance Computing Library for Statistical Phylogenetics , 2019, Systematic biology.

[54]  Daniel L. Ayres,et al.  BEAGLE: An Application Programming Interface and High-Performance Computing Library for Statistical Phylogenetics , 2011, Systematic biology.

[55]  Michael Worobey,et al.  A synchronized global sweep of the internal genes of modern avian influenza virus , 2014, Nature.

[56]  Daniel J. Wilson,et al.  Sequential Monte Carlo with transformations , 2016, Statistics and Computing.

[57]  Michael I. Jordan,et al.  Phylogenetic Inference via Sequential Monte Carlo , 2012, Systematic biology.

[58]  E. Virginia Armbrust,et al.  pplacer: linear time maximum-likelihood and Bayesian phylogenetic placement of sequences onto a fixed reference tree , 2010, BMC Bioinformatics.

[59]  Vu C. Dinh,et al.  Online Bayesian Phylogenetic Inference: Theoretical Foundations via Sequential Monte Carlo , 2016, Systematic biology.

[60]  Daniel L. Ayres,et al.  Bayesian phylogenetic and phylodynamic data integration using BEAST 1.10 , 2018, Virus evolution.

[61]  Guy Baele,et al.  Adaptive MCMC in Bayesian phylogenetics: an application to analyzing partitioned data in BEAST , 2017, Bioinform..

[62]  Mandev S. Gill,et al.  Improving Bayesian population dynamics inference: a coalescent-based model for multiple loci. , 2013, Molecular biology and evolution.