Fast Convergence of MCMC Algorithms for Phylogenetic Reconstruction with Homogeneous Data on Closely Related Species

This paper studies a Markov chain for phylogenetic reconstruction which uses a popular transition between tree topologies known as subtree pruning-and-regrafting (SPR). We analyze the Markov chain in the simpler setting that the generating tree consists of very short edge lengths, short enough so that each sample from the generating tree (or character in phylogenetic terminology) is likely to have only one mutation, and that there enough samples so that the data looks like the generating distribution. We prove in this setting that the Markov chain is rapidly mixing, i.e., it quickly converges to its stationary distribution, which is the posterior distribution over tree topologies. Our proofs use that the leading term of the maximum likelihood function of a tree T is the maximum parsimony score, which is the size of the minimum cut in T needed to realize single edge cuts of the generating tree. Our main contribution is a combinatorial proof that in our simplified setting, SPR moves are guaranteed to converge quickly to the maximum parsimony tree. Our results are in contrast to recent works showing examples with heterogeneous data (namely, the data is generated from a mixture distribution) where many natural Markov chains are exponentially slow to converge to the stationary distribution.

[1]  Eric Vigoda,et al.  Phylogeny of Mixture Models: Robustness of Maximum Likelihood and Non-Identifiable Distributions , 2006, J. Comput. Biol..

[2]  John P. Huelsenbeck,et al.  MRBAYES: Bayesian inference of phylogenetic trees , 2001, Bioinform..

[3]  D. Aldous Random walks on finite groups and rapidly mixing markov chains , 1983 .

[4]  P. Diaconis,et al.  Random walks on trees and matchings , 2002 .

[5]  László A. Székely,et al.  Inverting Random Functions III: Discrete MLE Revisited , 2006 .

[6]  Timothy J. Harlow,et al.  Searching for convergence in phylogenetic Markov chain Monte Carlo. , 2006, Systematic biology.

[7]  David Bryant,et al.  Parsimony via consensus. , 2007, Systematic biology.

[8]  Eric Vigoda,et al.  Pitfalls of heterogeneous processes for phylogenetic reconstruction. , 2007, Systematic biology.

[9]  Elchanan Mossel,et al.  Limitations of Markov chain Monte Carlo algorithms for Bayesian inference of phylogeny , 2005, The Annals of Applied Probability.

[10]  C. Geyer,et al.  Annealing Markov chain Monte Carlo with applications to ancestral inference , 1995 .

[11]  J. Huelsenbeck,et al.  Efficiency of Markov chain Monte Carlo tree proposals in Bayesian phylogenetics. , 2008, Systematic biology.

[12]  Joseph Felsenstein,et al.  A likelihood approach to character weighting and what it tells us about parsimony and compatibility , 1981 .

[13]  M Steel,et al.  Links between maximum likelihood and maximum parsimony under a simple model of site substitution. , 1997, Bulletin of mathematical biology.

[14]  Bernd A. Berg Markov Chain Monte Carlo Simulations and Their Statistical Analysis , 2004 .

[15]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[16]  László A. Székely,et al.  Inverting random functions , 1999 .

[17]  László A. Székely,et al.  Inverting Random Functions II: Explicit Bounds for Discrete Maximum Likelihood Estimation, with Applications , 2002, SIAM J. Discret. Math..