A nonparametric method for accommodating and testing across-site rate variation.

Substitution rates are one of the most fundamental parameters in a phylogenetic analysis and are represented in phylogenetic models as the branch lengths on a tree. Variation in substitution rates across an alignment of molecular sequences is well established and likely caused by variation in functional constraint across the genes encoded in the sequences. Rate variation across alignment sites is important to accommodate in a phylogenetic analysis; failure to account for across-site rate variation can cause biased estimates of phylogeny or other model parameters. Traditionally, rate variation across sites has been modeled by treating the rate for a site as a random variable drawn from some probability distribution (such as the gamma probability distribution) or by partitioning sites to different rate classes and estimating the rate for each class independently. We consider a different approach, related to site-specific models in which sites are partitioned to rate classes. However, instead of treating the partitioning scheme in which sites are assigned to rate classes as a fixed assumption of the analysis, we treat the rate partitioning as a random variable under a Dirichlet process prior. We find that the Dirichlet process prior model for across-site rate variation fits alignments of DNA sequence data better than commonly used models of across-site rate variation. The method appears to identify the underlying codon structure of protein-coding genes; rate partitions that were sampled by the Markov chain Monte Carlo procedure were closer to a partition in which sites are assigned to rate classes by codon position than to randomly permuted partitions but still allow for additional variability across sites.

[1]  R. Nielsen,et al.  Site-by-site estimation of the rate of substitution and the correlation of rates in mitochondrial DNA. , 1997, Systematic biology.

[2]  J. Huelsenbeck,et al.  SUCCESS OF PHYLOGENETIC METHODS IN THE FOUR-TAXON CASE , 1993 .

[3]  Terence Hwa,et al.  Substantial Regional Variation in Substitution Rates in the Human Genome: Importance of GC Content, Gene Density, and Telomere-Specific Effects , 2005, Journal of Molecular Evolution.

[4]  B. Larget,et al.  Markov Chain Monte Carlo Algorithms for the Bayesian Analysis of Phylogenetic Trees , 2000 .

[5]  Simon D W Frost,et al.  A simple hierarchical approach to modeling distributions of substitution rates. , 2005, Molecular biology and evolution.

[6]  L. Jin,et al.  Limitations of the evolutionary parsimony method of phylogenetic analysis. , 1990, Molecular biology and evolution.

[7]  M A Newton,et al.  Bayesian Phylogenetic Inference via Markov Chain Monte Carlo Methods , 1999, Biometrics.

[8]  Ziheng Yang Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods , 1994, Journal of Molecular Evolution.

[9]  N. Metropolis,et al.  Equation of State Calculations by Fast Computing Machines , 1953, Resonance.

[10]  Radford M. Neal Markov Chain Sampling Methods for Dirichlet Process Mixture Models , 2000 .

[11]  Bob Mau,et al.  Markov chain Monte Carlo for the Bayesian analysis of evolutionary trees from aligned molecular sequences , 1999 .

[12]  Dan Gusfield,et al.  Partition-distance: A problem and class of perfect graphs arising in clustering , 2002, Inf. Process. Lett..

[13]  N. Goldman,et al.  Codon-substitution models for heterogeneous selection pressure at amino acid sites. , 2000, Genetics.

[14]  John P. Huelsenbeck,et al.  MrBayes 3: Bayesian phylogenetic inference under mixed models , 2003, Bioinform..

[15]  W. Fitch,et al.  An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution , 1970, Biochemical Genetics.

[16]  Andrew Meade,et al.  Mixture models in phylogenetic inference , 2007, Mathematics of Evolution and Phylogeny.

[17]  Peter Green,et al.  Highly Structured Stochastic Systems , 2003 .

[18]  H. Kishino,et al.  Man's place in Hominoidea as inferred from molecular clocks of DNA , 2005, Journal of Molecular Evolution.

[19]  P. Green,et al.  Trans-dimensional Markov chain Monte Carlo , 2000 .

[20]  H. Philippe,et al.  Computing Bayes factors using thermodynamic integration. , 2006, Systematic biology.

[21]  D. White,et al.  Constructive combinatorics , 1986 .

[22]  Z. Yang,et al.  Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. , 1993, Molecular biology and evolution.

[23]  B. Rannala,et al.  Probability distribution of molecular evolutionary trees: A new method of phylogenetic inference , 1996, Journal of Molecular Evolution.

[24]  L. Tierney Markov Chains for Exploring Posterior Distributions , 1994 .

[25]  M. Holder,et al.  Hastings ratio of the LOCAL proposal used in Bayesian phylogenetics. , 2005, Systematic biology.

[26]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[27]  J. Huelsenbeck Testing a covariotide model of DNA substitution. , 2002, Molecular biology and evolution.

[28]  G. Olsen,et al.  Earliest phylogenetic branchings: comparing rRNA-based evolutionary trees inferred with various techniques. , 1987, Cold Spring Harbor symposia on quantitative biology.

[29]  M. Newton,et al.  Phylogenetic Inference for Binary Data on Dendograms Using Markov Chain Monte Carlo , 1997 .

[30]  Radford M. Neal,et al.  Splitting and merging components of a nonconjugate Dirichlet process mixture model , 2007 .

[31]  T. Ferguson A Bayesian Analysis of Some Nonparametric Problems , 1973 .

[32]  W. Li,et al.  Maximum likelihood estimation of the heterogeneity of substitution rate among nucleotide sites. , 1995, Molecular biology and evolution.

[33]  John P. Huelsenbeck,et al.  MRBAYES: Bayesian inference of phylogenetic trees , 2001, Bioinform..

[34]  M. Nei,et al.  Infinite allele model with varying mutation rate. , 1976, Proceedings of the National Academy of Sciences of the United States of America.

[35]  W. K. Hastings,et al.  Monte Carlo Sampling Methods Using Markov Chains and Their Applications , 1970 .

[36]  Radford M. Neal,et al.  A Split-Merge Markov chain Monte Carlo Procedure for the Dirichlet Process Mixture Model , 2004 .

[37]  Olivier Gascuel,et al.  Mathematics of Evolution and Phylogeny , 2005 .

[38]  J. Wakeley,et al.  Substitution-rate variation among sites and the estimation of transition bias. , 1994, Molecular biology and evolution.

[39]  M. Newton Approximate Bayesian-inference With the Weighted Likelihood Bootstrap , 1994 .

[40]  Z. Yang,et al.  A space-time process model for the evolution of DNA sequences. , 1995, Genetics.

[41]  B. Rannala,et al.  Bayesian phylogenetic inference using DNA sequences: a Markov Chain Monte Carlo Method. , 1997, Molecular biology and evolution.

[42]  S. MacEachern,et al.  Estimating mixture of dirichlet process models , 1998 .

[43]  M. Steel,et al.  General time-reversible distances with unequal rates across sites: mixing gamma and inverse Gaussian distributions with invariant sites. , 1997, Molecular phylogenetics and evolution.

[44]  J. Hansen FOR THE EWENS SAMPLING FORMULA , 1990 .

[45]  S. Nadler,et al.  Disparate rates of molecular evolution in cospeciating hosts and parasites. , 1994, Science.

[46]  M. Pagel,et al.  A phylogenetic mixture model for detecting pattern-heterogeneity in gene sequence or character-state data. , 2004, Systematic biology.

[47]  Thomas Uzzell,et al.  Fitting Discrete Probability Distributions to Evolutionary Events , 1971, Science.

[48]  C. Simon,et al.  Exploring among-site rate variation models in a maximum likelihood framework using empirical data: effects of model assumptions on estimates of topology, branch lengths, and bootstrap support. , 2001, Systematic biology.

[49]  N. Galtier,et al.  Maximum-likelihood phylogenetic analysis under a covarion-like model. , 2001, Molecular biology and evolution.

[50]  H. Philippe,et al.  A Bayesian mixture model for across-site heterogeneities in the amino-acid replacement process. , 2004, Molecular biology and evolution.

[51]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[52]  M. Steel,et al.  Modeling the covarion hypothesis of nucleotide substitution. , 1998, Mathematical biosciences.

[53]  Hani Doss,et al.  Phylogenetic Tree Construction Using Markov Chain Monte Carlo , 2000 .

[54]  John P Huelsenbeck,et al.  A Dirichlet process model for detecting positive selection in protein-coding DNA sequences. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[55]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[56]  P. Green Reversible jump Markov chain Monte Carlo computation and Bayesian model determination , 1995 .