Complexity reduction in context-dependent DNA substitution models

MOTIVATION The modeling of conservation patterns in genomic DNA has become increasingly popular for a number of bioinformatic applications. While several systems developed to date incorporate context-dependence in their substitution models, the impact on computational complexity and generalization ability of the resulting higher order models invites the question of whether simpler approaches to context modeling might permit appreciable reductions in model complexity and computational cost, without sacrificing prediction accuracy. RESULTS We formulate several alternative methods for context modeling based on windowed Bayesian networks, and compare their effects on both accuracy and computational complexity for the task of discriminating functionally distinct segments in vertebrate DNA. Our results show that substantial reductions in the complexity of both the model and the associated inference algorithm can be achieved without reducing predictive accuracy.

[1]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[2]  W. Fitch Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology , 1971 .

[3]  H. Ellegren,et al.  A low rate of simultaneous double-nucleotide mutations in primates. , 2003, Molecular biology and evolution.

[4]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[5]  Michael B. Eisen,et al.  Phylogenetic Motif Detection by Expectation-Maximization on Evolutionary Mixtures , 2003, Pacific Symposium on Biocomputing.

[6]  Elmar Nöth,et al.  Interpolated markov chains for eukaryotic promoter recognition , 1999, Bioinform..

[7]  H. Munro,et al.  Mammalian protein metabolism , 1964 .

[8]  Simon Whelan,et al.  Estimating the Frequency of Events That Cause Multiple-Nucleotide Changes , 2004, Genetics.

[9]  D. Haussler,et al.  Using multiple alignments and phylogenetic trees to detect RNA secondary structure. , 1996, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[10]  S. Salzberg,et al.  Interpolated Markov models for eukaryotic gene finding. , 1999, Genomics.

[11]  J. L. Jensen,et al.  Probabilistic models of DNA sequence evolution with context dependent rates of substitution , 2000, Advances in Applied Probability.

[12]  T. Jukes CHAPTER 24 – Evolution of Protein Molecules , 1969 .

[13]  S. Tavaré Some probabilistic and statistical problems in the analysis of DNA sequences , 1986 .

[14]  David Haussler,et al.  Combining Phylogenetic and Hidden Markov Models in Biosequence Analysis , 2004, J. Comput. Biol..

[15]  Martin A. Nowak,et al.  Inferring Cellular Networks Using Probabilistic Graphical Models , 2004 .

[16]  P. Green,et al.  Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[17]  D. Haussler,et al.  Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. , 2003, Molecular biology and evolution.

[18]  Michael R. Brent,et al.  Using Multiple Alignments to Improve Gene Prediction , 2005, RECOMB.

[19]  Alexei J Drummond,et al.  Choosing appropriate substitution models for the phylogenetic analysis of protein-coding sequences. , 2006, Molecular biology and evolution.

[20]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[21]  N. Goldman,et al.  A codon-based model of nucleotide substitution for protein-coding DNA sequences. , 1994, Molecular biology and evolution.

[22]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[23]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[24]  Francesca Chiaromonte,et al.  ESPERR: learning strong and weak signals in genomic sequence alignments to identify functional elements. , 2006, Genome research.

[25]  N. Saitou,et al.  The neighbor-joining method: a new method for reconstructing phylogenetic trees. , 1987, Molecular biology and evolution.

[26]  Lior Pachter,et al.  Multiple-sequence functional annotation and the generalized hidden Markov phylogeny , 2004, Bioinform..

[27]  R. Percudani Restricted wobble rules for eukaryotic genomes. , 2001, Trends in genetics : TIG.

[28]  Jakob Skou Pedersen,et al.  Gene finding with a hidden Markov model of genome structure and evolution , 2003, Bioinform..

[29]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[30]  Peter F. Arndt,et al.  Identification and Measurement of Neigbor Dependent Nucleotide Substitution Processes , 2005, German Conference on Bioinformatics.

[31]  David J. Spiegelhalter,et al.  Local computations with probabilities on graphical structures and their application to expert systems , 1990 .

[32]  Kenneth Lange,et al.  Codon and rate variation models in molecular phylogeny. , 2002, Molecular biology and evolution.

[33]  Sean R. Eddy,et al.  Biological sequence analysis: Preface , 1998 .

[34]  X. Jin Factor graphs and the Sum-Product Algorithm , 2002 .

[35]  P. Sharp,et al.  Evidence for a high frequency of simultaneous double-nucleotide substitutions. , 2000, Science.

[36]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1998, Learning in Graphical Models.

[37]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[38]  Lior Pachter,et al.  MAVID: constrained ancestral alignment of multiple sequences. , 2003, Genome research.

[39]  Erik van Nimwegen,et al.  PhyloGibbs: A Gibbs Sampling Motif Finder That Incorporates Phylogeny , 2005, PLoS Comput. Biol..

[40]  Nebojsa Jojic,et al.  Efficient approximations for learning phylogenetic HMM models from data , 2004, ISMB/ECCB.

[41]  J. Harrow,et al.  GENCODE: producing a reference annotation for ENCODE , 2006, Genome Biology.