Choosing among Partition Models in Bayesian Phylogenetics

Bayesian phylogenetic analyses often depend on Bayes factors (BFs) to determine the optimal way to partition the data. The marginal likelihoods used to compute BFs, in turn, are most commonly estimated using the harmonic mean (HM) method, which has been shown to be inaccurate. We describe a new more accurate method for estimating the marginal likelihood of a model and compare it with the HM method on both simulated and empirical data. The new method generalizes our previously described stepping-stone (SS) approach by making use of a reference distribution parameterized using samples from the posterior distribution. This avoids one challenging aspect of the original SS method, namely the need to sample from distributions that are close (in the Kullback–Leibler sense) to the prior. We specifically address the choice of partition models and find that using the HM method can lead to a strong preference for an overpartitioned model. In contrast to the HM method and the original SS method, we show using simulated data that the generalized SS method is strikingly more precise (repeatable BF values of the same data and partition model) and yields BF values that are much more reasonable than those produced by the HM method. Comparisons of HM and generalized SS methods on an empirical data set demonstrate that the generalized SS method tends to choose simpler partition schemes that are more in line with expectation based on inferred patterns of molecular evolution. The generalized SS method shares with thermodynamic integration the need to sample from a series of distributions in addition to the posterior. Such dedicated path-based Markov chain Monte Carlo analyses appear to be a cost of estimating marginal likelihoods accurately.

[1]  Ming-Hui Chen,et al.  Improving marginal likelihood estimation for Bayesian phylogenetic model selection. , 2011, Systematic biology.

[2]  Geneviève Lefebvre,et al.  A path sampling identity for computing the Kullback-Leibler and J divergences , 2010, Comput. Stat. Data Anal..

[3]  D. Marshall,et al.  Cryptic failure of partitioned Bayesian phylogenetic analyses: lost in the land of long trees. , 2010, Systematic biology.

[4]  S. Aris-Brosou,et al.  A time line of the environmental genetics of the haptophytes. , 2010, Molecular biology and evolution.

[5]  Matthew W. Brown,et al.  Phylogeny of the "forgotten" cellular slime mold, Fonticula alba, reveals a key evolutionary branch within Opisthokonta. , 2009, Molecular biology and evolution.

[6]  K. Middleton,et al.  Mosaicism, modules, and the evolution of birds: results from a Bayesian approach to the study of morphological evolution using discrete character data. , 2008, Systematic biology.

[7]  Marc A Suchard,et al.  A nonparametric method for accommodating and testing across-site rate variation. , 2007, Systematic biology.

[8]  K. Tamura,et al.  Phylogeny of the Drosophila immigrans Species Group (Diptera: Drosophilidae) Based on Adh and Gpdh Sequences , 2007, Zoological science.

[9]  Jeremy M. Brown,et al.  The importance of data partitioning and the utility of Bayes factors in Bayesian phylogenetics. , 2007, Systematic biology.

[10]  C. Simon,et al.  Accurate branch length estimation in partitioned Bayesian analyses requires accommodation of among-partition rate variation and attention to branch length priors. , 2006, Systematic biology.

[11]  H. Philippe,et al.  Computing Bayes factors using thermodynamic integration. , 2006, Systematic biology.

[12]  J. Suchorzewska Głos w dyskusji , 2005 .

[13]  A. Schmitz,et al.  Partitioned Bayesian analyses, partition choice, and the phylogenetic relationships of scincid lizards. , 2005, Systematic biology.

[14]  Michael P. Cummings,et al.  PAUP* [Phylogenetic Analysis Using Parsimony (and Other Methods)] , 2004 .

[15]  J. Huelsenbeck,et al.  Bayesian phylogenetic model selection using reversible jump Markov chain Monte Carlo. , 2004, Molecular biology and evolution.

[16]  J. Huelsenbeck,et al.  Bayesian phylogenetic analysis of combined data. , 2004, Systematic biology.

[17]  John P. Huelsenbeck,et al.  MrBayes 3: Bayesian phylogenetic inference under mixed models , 2003, Bioinform..

[18]  S. Walker Invited comment on the paper "Slice Sampling" by Radford Neal , 2003 .

[19]  M. Suchard,et al.  Bayesian selection of continuous-time Markov chain evolutionary models. , 2001, Molecular biology and evolution.

[20]  D. Dittmar Slice Sampling , 2000 .

[21]  Xiao-Li Meng,et al.  Simulating Normalizing Constants: From Importance Sampling to Bridge Sampling to Path Sampling , 1998 .

[22]  Andrew Rambaut,et al.  Seq-Gen: an application for the Monte Carlo simulation of DNA sequence evolution along phylogenetic trees , 1997, Comput. Appl. Biosci..

[23]  Xiao-Li Meng,et al.  SIMULATING RATIOS OF NORMALIZING CONSTANTS VIA A SIMPLE IDENTITY: A THEORETICAL EXPLORATION , 1996 .

[24]  L. Wasserman,et al.  Computing Bayes Factors Using a Generalization of the Savage-Dickey Density Ratio , 1995 .

[25]  M. Newton Approximate Bayesian-inference With the Weighted Likelihood Bootstrap , 1994 .

[26]  G. Schwarz Estimating the Dimension of a Model , 1978 .

[27]  H. Akaike A new look at the statistical model identification , 1974 .

[28]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[29]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[30]  S. S. Wilks The Large-Sample Distribution of the Likelihood Ratio for Testing Composite Hypotheses , 1938 .