Modelling the ancestral sequence distribution and model frequencies in context-dependent models for primate non-coding sequences

BackgroundRecent approaches for context-dependent evolutionary modelling assume that the evolution of a given site depends upon its ancestor and that ancestor's immediate flanking sites. Because such dependency pattern cannot be imposed on the root sequence, we consider the use of different orders of Markov chains to model dependence at the ancestral root sequence. Root distributions which are coupled to the context-dependent model across the underlying phylogenetic tree are deemed more realistic than decoupled Markov chains models, as the evolutionary process is responsible for shaping the composition of the ancestral root sequence.ResultsWe find strong support, in terms of Bayes Factors, for using a second-order Markov chain at the ancestral root sequence along with a context-dependent model throughout the remainder of the phylogenetic tree in an ancestral repeats dataset, and for using a first-order Markov chain at the ancestral root sequence in a pseudogene dataset. Relaxing the assumption of a single context-independent set of independent model frequencies as presented in previous work, yields a further drastic increase in model fit. We show that the substitution rates associated with the CpG-methylation-deamination process can be modelled through context-dependent model frequencies and that their accuracy depends on the (order of the) Markov chain imposed at the ancestral root sequence. In addition, we provide evidence that this approach (which assumes that root distribution and evolutionary model are decoupled) outperforms an approach inspired by the work of Arndt et al., where the root distribution is coupled to the evolutionary model. We show that the continuous-time approximation of Hwang and Green has stronger support in terms of Bayes Factors, but the parameter estimates show minimal differences.ConclusionsWe show that the combination of a dependency scheme at the ancestral root sequence and a context-dependent evolutionary model across the remainder of the tree allows for accurate estimation of the model's parameters. The different assumptions tested in this manuscript clearly show that designing accurate context-dependent models is a complex process, with many different assumptions that require validation. Further, these assumptions are shown to change across different datasets, making the search for an adequate model for a given dataset quite challenging.

[1]  Christopher B. Burge,et al.  DNA Sequence Evolution with Neighbor-Dependent Mutation , 2003, J. Comput. Biol..

[2]  Eric D Green,et al.  Differences between pair-wise and multi-sequence alignment methods affect vertebrate genome comparisons. , 2006, Trends in genetics : TIG.

[3]  J. Schafer,et al.  Analysis of Incomplete Multivariate Data (Monographs on Statistics and Applied Probability, No. 72) , 2000 .

[4]  D. Penny Inferring Phylogenies.—Joseph Felsenstein. 2003. Sinauer Associates, Sunderland, Massachusetts. , 2004 .

[5]  J. L. Jensen,et al.  Probabilistic models of DNA sequence evolution with context dependent rates of substitution , 2000, Advances in Applied Probability.

[6]  H. Philippe,et al.  Assessing site-interdependent phylogenetic models of sequence evolution. , 2006, Molecular biology and evolution.

[7]  C. H. Edwards,et al.  Calculus with analytic geometry , 1994 .

[8]  ben-Avraham,et al.  Mean-field (n,m)-cluster approximation for lattice models. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[9]  Wanjun Gu,et al.  Rapid likelihood analysis on large phylogenies using partial sampling of substitution histories. , 2010, Molecular biology and evolution.

[10]  Ziheng Yang Estimating the pattern of nucleotide substitution , 1994, Journal of Molecular Evolution.

[11]  K. J. Fryxell,et al.  Cytosine deamination plays a primary role in the evolution of mammalian isochores. , 2000, Molecular biology and evolution.

[12]  David T. Jones,et al.  Protein evolution with dependence among codons due to tertiary structure. , 2003, Molecular biology and evolution.

[13]  Jonathan P. Bollback,et al.  Bayesian Inference of Phylogeny and Its Impact on Evolutionary Biology , 2001, Science.

[14]  Mark Holder,et al.  Model parameterization, prior distributions, and the general time-reversible model in Bayesian phylogenetics. , 2004, Systematic biology.

[15]  Mike Steel,et al.  Should phylogenetic models be trying to "fit an elephant"? , 2005, Trends in genetics : TIG.

[16]  G. Serio,et al.  A new method for calculating evolutionary substitution rates , 2005, Journal of Molecular Evolution.

[17]  B. Blaisdell,et al.  Markov chain analysis finds a significant influence of neighboring bases on the occurrence of a base in eucaryotic nuclear DNA sequences both protein-coding and noncoding , 1985, Journal of Molecular Evolution.

[18]  Michael T. Clegg,et al.  Neighboring base composition is strongly correlated with base substitution bias in a region of the chloroplast genome , 1995, Journal of Molecular Evolution.

[19]  P. Green,et al.  Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[20]  H. Philippe,et al.  Computing Bayes factors using thermodynamic integration. , 2006, Systematic biology.

[21]  Sylvia Richardson,et al.  Markov Chain Monte Carlo in Practice , 1997 .

[22]  W. Salser Globin mRNA sequences: analysis of base pairing and evolutionary implications. , 1978, Cold Spring Harbor symposia on quantitative biology.

[23]  B. Morton,et al.  Neighboring base composition and transversion/transition bias in a comparison of rice and maize chloroplast noncoding regions. , 1995, Proceedings of the National Academy of Sciences of the United States of America.

[24]  G. G. Altman,et al.  A search for patterns in the nucleotide sequence of the MS2 genome , 1979 .

[25]  D. Haussler,et al.  Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. , 2003, Molecular biology and evolution.

[26]  M. Clegg,et al.  The Influence of Specific Neighboring Bases on Substitution Bias in Noncoding Regions of the Plant Chloroplast Genome , 1997, Journal of Molecular Evolution.

[27]  W. Wong,et al.  The calculation of posterior distributions by data augmentation , 1987 .

[28]  P. Green,et al.  Transcription-associated mutational asymmetry in mammalian evolution , 2003, Nature Genetics.

[29]  B. Morton The Influence of Neighboring Base Composition on Substitutions in Plant Chloroplast Coding Sequences , 1997 .

[30]  Ziheng Yang,et al.  Mutation-selection models of codon substitution and their use to estimate selective strengths on codon usage. , 2008, Molecular biology and evolution.

[31]  Guy Baele,et al.  A model-based approach to study nearest-neighbor influences reveals complex substitution patterns in non-coding sequences. , 2008, Systematic biology.

[32]  Guy Baele,et al.  Efficient context-dependent model building based on clustering posterior distributions for non-coding sequences , 2009, BMC Evolutionary Biology.

[33]  Hervé Philippe,et al.  Mutation-selection models of coding sequence evolution with site-heterogeneous amino acid fitness profiles , 2010, Proceedings of the National Academy of Sciences.