Phylogenetic Hidden Markov Models

Phylogenetic hidden Markov models, or phylo-HMMs, are probabilistic models that consider not only the way substitutions occur through evolutionary history at each site of a genome but also the way this process changes from one site to the next. By treating molecular evolution as a combination of two Markov processes—one that operates in the dimension of space (along a genome) and one that operates in the dimension of time (along the branches of a phylogenetic tree)—these models allow aspects of both sequence structure and sequence evolution to be captured. Moreover, as we will discuss, they permit key computations to be performed exactly and efficiently. Phylo-HMMs allow evolutionary information to be brought to bear on a wide variety of problems of sequence " segmentation, " such as gene prediction and the identification of conserved elements. Phylo-HMMs were first proposed as a way of improving phylogenetic models that allow for variation among sites in the rate of substitution [9, 52]. Soon afterward, they were adapted for the problem of secondary structure prediction [11, 47], and some time later for the detection of recombination events [20]. Recently there has been a revival of interest in these models [41, 42, 43, 44, 33], in connection with an explosion in the availability of comparative sequence data, and an accompanying surge of interest in comparative methods for the detection of functional elements [5, 3, 24, 46, 6]. There has been particular interest in applying phylo-HMMs to a multispecies version of the ab initio gene prediction problem [41, 43, 33]. In this chapter, phylo-HMMs are introduced, and examples are presented illustrating how they can be used both to identify regions of interest in multiply aligned sequences and to improve the goodness of fit of ordinary phylo-genetic models. In addition, we discuss how hidden Markov models (HMMs), phylogenetic models, and phylo-HMMs all can be considered special cases of general " graphical models " and how the algorithms that are used with these models can be considered special cases of more general algorithms. This chapter is written at a tutorial level, suitable for readers who are familiar with phylogenetic models but have had limited exposure to other kinds of graphi-cal models.

[1]  David Haussler,et al.  Combining phylogenetic and hidden Markov models in biosequence analysis , 2003, RECOMB '03.

[2]  W. Freeman,et al.  Bethe free energy, Kikuchi approximations, and belief propagation algorithms , 2001 .

[3]  W. Miller,et al.  Comparison of five methods for finding conserved sequences in multiple alignments of gene regulatory regions. , 1999, Nucleic acids research.

[4]  Jon D. McAuliffe,et al.  Phylogenetic Shadowing of Primate Sequences to Find Functional Regions of the Human Genome , 2003, Science.

[5]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[6]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[7]  J. L. Jensen,et al.  Probabilistic models of DNA sequence evolution with context dependent rates of substitution , 2000, Advances in Applied Probability.

[8]  P. Lio’,et al.  Models of molecular evolution and phylogeny. , 1998, Genome research.

[9]  Lisa M. D'Souza,et al.  Genome sequence of the Brown Norway rat yields insights into mammalian evolution , 2004, Nature.

[10]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.

[11]  D. Haussler,et al.  Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. , 2003, Molecular biology and evolution.

[12]  Michael I. Jordan,et al.  Graphical models: Probabilistic inference , 2002 .

[13]  Ian Holmes,et al.  Evolutionary HMMs: a Bayesian approach to multiple alignment , 2001, Bioinform..

[14]  Richard Durbin,et al.  Comparative ab initio prediction of gene structures using pair HMMs , 2002, Bioinform..

[15]  Michael A. Arbib,et al.  The handbook of brain theory and neural networks , 1995, A Bradford book.

[16]  L. Pachter,et al.  SLAM: cross-species gene finding and alignment with a generalized pair hidden Markov model. , 2003, Genome research.

[17]  B. Birren,et al.  Sequencing and comparison of yeast species to identify genes and regulatory elements , 2003, Nature.

[18]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[19]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[20]  Dirk Husmeier,et al.  Detection of Recombination in DNA Multiple Alignments with Hidden Markov Models , 2002, J. Comput. Biol..

[21]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[22]  David Heckerman,et al.  A Tutorial on Learning with Bayesian Networks , 1998, Learning in Graphical Models.

[23]  Tal Pupko,et al.  A structural EM algorithm for phylogenetic inference , 2001, J. Comput. Biol..

[24]  J. Felsenstein,et al.  A Hidden Markov Model approach to variation among sites in rate of evolution. , 1996, Molecular biology and evolution.

[25]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[26]  Bjarne Knudsen,et al.  RNA secondary structure prediction using stochastic context-free grammars and evolutionary history , 1999, Bioinform..

[27]  N. Goldman,et al.  A codon-based model of nucleotide substitution for protein-coding DNA sequences. , 1994, Molecular biology and evolution.

[28]  Jens Ledet Jensen,et al.  Recursions for statistical multiple alignment , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[29]  G. Mitchison A Probabilistic Treatment of Phylogeny and Sequence Alignment , 1999, Journal of Molecular Evolution.

[30]  Robert Cowell,et al.  Introduction to Inference for Bayesian Networks , 1998, Learning in Graphical Models.

[31]  J. L. Jensen,et al.  A dependent-rates model and an MCMC-based methodology for the maximum-likelihood analysis of sequences with overlapping reading frames. , 2001, Molecular biology and evolution.

[32]  J. W. Thomas,et al.  Comparative analyses of multi-species sequences from targeted genomic regions , 2003, Nature.

[33]  Richard A. Goldstein,et al.  Probabilistic reconstruction of ancestral protein sequences , 1996, Journal of Molecular Evolution.

[34]  I. Holmes,et al.  Using guide trees to construct multiple-sequence evolutionary HMMs , 2003, ISMB.

[35]  S T Hess,et al.  Wide variations in neighbor-dependent substitution rates. , 1994, Journal of molecular biology.

[36]  S. Geman,et al.  Dynamic programming, tree-width and computation on graphical models , 2002 .

[37]  David C. Jones,et al.  Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses. , 1996, Journal of molecular biology.

[38]  Pietro Liò,et al.  PASSML: combining evolutionary inference and protein secondary structure prediction , 1998, Bioinform..

[39]  Christopher B. Burge,et al.  DNA Sequence Evolution with Neighbor-Dependent Mutation , 2003, J. Comput. Biol..

[40]  D. Husmeier,et al.  Detecting recombination in 4-taxa DNA sequence alignments with Bayesian hidden Markov models and Markov chain Monte Carlo. , 2003, Molecular biology and evolution.

[41]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[42]  C. Wiuf,et al.  A codon-based model designed to describe lentiviral evolution. , 1998, Molecular biology and evolution.

[43]  Sean R Eddy,et al.  What is dynamic programming? , 2004, Nature Biotechnology.

[44]  David C. Jones,et al.  Combining protein evolution and secondary structure. , 1996, Molecular biology and evolution.

[45]  Lior Pachter,et al.  Multiple-sequence functional annotation and the generalized hidden Markov phylogeny , 2004, Bioinform..

[46]  P. Lio’,et al.  Molecular phylogenetics: state-of-the-art methods for looking into the past. , 2001, Trends in genetics : TIG.

[47]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[48]  Marina Alexandersson Cross-species Gene Finding and Alignment with a Generalized Pair Hidden Markov Model , 2003 .

[49]  Nebojsa Jojic,et al.  Efficient approximations for learning phylogenetic HMM models from data , 2004, ISMB/ECCB.

[50]  D Haussler,et al.  The share of human genomic DNA under selection estimated from human-mouse genomic alignments. , 2003, Cold Spring Harbor symposia on quantitative biology.

[51]  Christopher B. Burge,et al.  DNA sequence evolution with neighbor-dependent mutation , 2001, RECOMB '02.

[52]  Martin J. Wainwright,et al.  Tree-based reparameterization framework for analysis of sum-product and related algorithms , 2003, IEEE Trans. Inf. Theory.

[53]  Ziheng Yang Estimating the pattern of nucleotide substitution , 1994, Journal of Molecular Evolution.

[54]  D. Haussler,et al.  Article Identification and Characterization of Multi-Species Conserved Sequences , 2022 .

[55]  Michael I. Jordan Learning in Graphical Models , 1999, NATO ASI Series.

[56]  Michael I. Jordan Graphical Models , 1998 .

[57]  David Haussler,et al.  Computational identification of evolutionarily conserved exons , 2004, RECOMB.

[58]  Gráinne McGuire,et al.  A Bayesian Model for Detecting Past Recombination Events in DNA Multiple Alignments , 2000, J. Comput. Biol..

[59]  Jakob Skou Pedersen,et al.  Gene finding with a hidden Markov model of genome structure and evolution , 2003, Bioinform..

[60]  Z. Yang,et al.  A space-time process model for the evolution of DNA sequences. , 1995, Genetics.