Point estimates in phylogenetic reconstructions

Motivation: The construction of statistics for summarizing posterior samples returned by a Bayesian phylogenetic study has so far been hindered by the poor geometric insights available into the space of phylogenetic trees, and ad hoc methods such as the derivation of a consensus tree makeup for the ill-definition of the usual concepts of posterior mean, while bootstrap methods mitigate the absence of a sound concept of variance. Yielding satisfactory results with sufficiently concentrated posterior distributions, such methods fall short of providing a faithful summary of posterior distributions if the data do not offer compelling evidence for a single topology. Results: Building upon previous work of Billera et al., summary statistics such as sample mean, median and variance are defined as the geometric median, Fréchet mean and variance, respectively. Their computation is enabled by recently published works, and embeds an algorithm for computing shortest paths in the space of trees. Studying the phylogeny of a set of plants, where several tree topologies occur in the posterior sample, the posterior mean balances correctly the contributions from the different topologies, where a consensus tree would be biased. Comparisons of the posterior mean, median and consensus trees with the ground truth using simulated data also reveals the benefits of a sound averaging method when reconstructing phylogenetic trees. Availability and implementation: We provide two independent implementations of the algorithm for computing Fréchet means, geometric medians and variances in the space of phylogenetic trees. TFBayes: https://github.com/pbenner/tfbayes, TrAP: https://github.com/bacak/TrAP. Contact: philipp.benner@mis.mpg.de

[1]  H. Philippe,et al.  A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. , 2004, Molecular biology and evolution.

[2]  Charles F. Delwiche,et al.  The Closest Living Relatives of Land Plants , 2001, Science.

[3]  David Bryant,et al.  A classification of consensus methods for phylogenetics , 2001, Bioconsensus.

[4]  A. Robertson,et al.  The evolution of DNA sequences. , 1986, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[5]  Christian P. Robert,et al.  The Bayesian choice , 1994 .

[6]  Z. Yang,et al.  Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. , 1993, Molecular biology and evolution.

[7]  Katharina T. Huber,et al.  Basic Phylogenetic Combinatorics , 2011 .

[8]  Z. Yang,et al.  Among-site rate variation and its impact on phylogenetic analyses. , 1996, Trends in ecology & evolution.

[9]  Christian P. Robert,et al.  Monte Carlo Statistical Methods , 2005, Springer Texts in Statistics.

[10]  M. Bacák Convex Analysis and Optimization in Hadamard Spaces , 2014 .

[11]  Miroslav Bacák,et al.  Computing Medians and Means in Hadamard Spaces , 2012, SIAM J. Optim..

[12]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[13]  John P. Huelsenbeck,et al.  MRBAYES: Bayesian inference of phylogenetic trees , 2001, Bioinform..

[14]  F. Sanger,et al.  Sequence and organization of the human mitochondrial genome , 1981, Nature.

[15]  A. Rambaut,et al.  BEAST: Bayesian evolutionary analysis by sampling trees , 2007, BMC Evolutionary Biology.

[16]  Ezra Miller,et al.  Averaging metric phylogenetic trees , 2012, ArXiv.

[17]  J. Liu,et al.  Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. , 2001, Nucleic acids research.

[18]  Louis J. Billera,et al.  Geometry of the Space of Phylogenetic Trees , 2001, Adv. Appl. Math..

[19]  O. Gascuel,et al.  A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. , 2003, Systematic biology.

[20]  Wenbin Li,et al.  Bayes estimators for phylogenetic reconstruction , 2009, Systematic biology.

[21]  Jürgen Jost,et al.  Nonpositive Curvature: Geometric And Analytic Aspects , 1997 .

[22]  Olivier Gascuel,et al.  Mathematics of Evolution and Phylogeny , 2005 .

[23]  D. Balding,et al.  Models of sequence evolution for DNA sequences containing gaps. , 2001, Molecular biology and evolution.

[25]  Tom M. W. Nye,et al.  Principal components analysis in the space of phylogenetic trees , 2011, 1202.5132.

[26]  Fred R. McMorris,et al.  Consensusn-trees , 1981 .

[27]  Ziheng Yang,et al.  Branch-length prior influences Bayesian posterior probability of phylogeny. , 2005, Systematic biology.

[28]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[29]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[30]  Jeet Sukumaran,et al.  A justification for reporting the majority-rule consensus tree in Bayesian phylogenetics. , 2008, Systematic biology.

[31]  Karl-Theodor Sturm,et al.  Probability Measures on Metric Spaces of Nonpositive Curvature , 2003 .

[32]  J. Scott Provan,et al.  A Fast Algorithm for Computing Geodesic Distances in Tree Space , 2009, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[33]  Christian P. Robert,et al.  The Bayesian choice : from decision-theoretic foundations to computational implementation , 2007 .

[34]  Ward C Wheeler,et al.  Topology-Bayes versus Clade-Bayes in phylogenetic analysis. , 2008, Molecular biology and evolution.

[35]  Alexandros Stamatakis,et al.  Novel information theory-based measures for quantifying incongruence among phylogenetic trees. , 2014, Molecular biology and evolution.

[36]  Karl-Theodor Sturm Nonlinear martingale theory for processes with values in metric spaces of nonpositive curvature , 2002 .

[37]  C. Geyer,et al.  Annealing Markov chain Monte Carlo with applications to ancestral inference , 1995 .

[38]  Nicolas Lartillot,et al.  PhyloBayes 3: a Bayesian software package for phylogenetic reconstruction and molecular dating , 2009, Bioinform..

[39]  Erik van Nimwegen,et al.  PhyloGibbs: A Gibbs Sampling Motif Finder That Incorporates Phylogeny , 2005, PLoS Comput. Biol..

[40]  A. Sandelin,et al.  Applied bioinformatics for the identification of regulatory elements , 2004, Nature Reviews Genetics.

[41]  Gonzalo Giribet,et al.  Evaluating topological conflict in centipede phylogeny using transcriptomic data sets. , 2014, Molecular biology and evolution.