Combining phylogenetic and hidden Markov models in biosequence analysis

A few models have appeared in recent years that consider not only the way substitutions occur through evolutionary history at each site of a genome, but also the way the process changes from one site to the next. These models combine phylogenetic models of molecular evolution, which apply to individual sites, and hidden Markov models, which allow for changes from site to site. Besides improving the realism of ordinary phylogenetic models, they are potentially very powerful tools for inference and prediction---for gene finding, for example, or prediction of secondary structure. In this paper, we review progress on combined phylogenetic and hidden Markov models and present some extensions to previous work. Our main result is a simple and efficient method for accommodating higher-order states in the HMM, which allows for context-sensitive models of substitution---that is, models that consider the effects of neighboring bases on the pattern of substitution. We present experimental results indicating that higher-order states, autocorrelated rates, and multiple functional categories all lead to significant improvements in the fit of a combined phylogenetic and hidden Markov model, with the effect of higher-order states being particularly pronounced.

[1]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[2]  Jakob Skou Pedersen,et al.  Gene finding with a hidden Markov model of genome structure and evolution , 2003, Bioinform..

[3]  P. Lio’,et al.  Molecular phylogenetics: state-of-the-art methods for looking into the past. , 2001, Trends in genetics : TIG.

[4]  Mouse Genome Sequencing Consortium Initial sequencing and comparative analysis of the mouse genome , 2002, Nature.

[5]  B. Rannala,et al.  Phylogenetic methods come of age: testing hypotheses in an evolutionary context. , 1997, Science.

[6]  Z. Yang,et al.  A space-time process model for the evolution of DNA sequences. , 1995, Genetics.

[7]  S. Muse Evolutionary analyses of DNA sequences subject to constraints of secondary structure. , 1995, Genetics.

[8]  N. Goldman,et al.  Comparison of models for nucleotide substitution used in maximum-likelihood phylogenetic estimation. , 1994, Molecular biology and evolution.

[9]  J. Kingman A FIRST COURSE IN STOCHASTIC PROCESSES , 1967 .

[10]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[11]  E. Tillier,et al.  Neighbor Joining and Maximum Likelihood with RNA Sequences: Addressing the Interdependence of Sites , 1995 .

[12]  A. Rzhetsky Estimating substitution rates in ribosomal RNA genes. , 1995, Genetics.

[13]  D. Haussler,et al.  A hidden Markov model that finds genes in E. coli DNA. , 1994, Nucleic acids research.

[14]  S. Karlin,et al.  Prediction of complete gene structures in human genomic DNA. , 1997, Journal of molecular biology.

[15]  J. W. Thomas,et al.  Comparative analyses of multi-species sequences from targeted genomic regions , 2003, Nature.

[16]  R. Gibbs,et al.  PipMaker--a web server for aligning two genomic DNA sequences. , 2000, Genome research.

[17]  G. Churchill Stochastic models for heterogeneous DNA sequences. , 1989, Bulletin of mathematical biology.

[18]  Pietro Liò,et al.  PASSML: combining evolutionary inference and protein secondary structure prediction , 1998, Bioinform..

[19]  David C. Jones,et al.  Combining protein evolution and secondary structure. , 1996, Molecular biology and evolution.

[20]  Lior Pachter,et al.  Multiple-sequence functional annotation and the generalized hidden Markov phylogeny , 2004, Bioinform..

[21]  Simon Whelan,et al.  Distributions of statistics used for the comparison of models of sequence evolution in phylogenetics , 1999 .

[22]  J. L. Jensen,et al.  Probabilistic models of DNA sequence evolution with context dependent rates of substitution , 2000, Advances in Applied Probability.

[23]  S. Muse,et al.  A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. , 1994, Molecular biology and evolution.

[24]  P. Lio’,et al.  Models of molecular evolution and phylogeny. , 1998, Genome research.

[25]  J. Neyman MOLECULAR STUDIES OF EVOLUTION: A SOURCE OF NOVEL STATISTICAL PROBLEMS* , 1971 .

[26]  M. Clegg,et al.  The Influence of Specific Neighboring Bases on Substitution Bias in Noncoding Regions of the Plant Chloroplast Genome , 1997, Journal of Molecular Evolution.

[27]  A. Krogh Two methods for improving performance of an HMM application for gene finding , 1997 .

[28]  David Haussler,et al.  A Generalized Hidden Markov Model for the Recognition of Human Genes in DNA , 1996, ISMB.

[29]  David Haussler,et al.  Computational identification of evolutionarily conserved exons , 2004, RECOMB.

[30]  W. Freeman,et al.  Bethe free energy, Kikuchi approximations, and belief propagation algorithms , 2001 .

[31]  A. Iserles Numerical recipes in C—the art of scientific computing , by W. H. Press, B. P. Flannery, S. A. Teukolsky and W. T. Vetterling. Pp 735. £27·50. 1988. ISBN 0-521-35465-X (Cambridge University Press) , 1989, The Mathematical Gazette.

[32]  Ziheng Yang,et al.  PAML: a program package for phylogenetic analysis by maximum likelihood , 1997, Comput. Appl. Biosci..

[33]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[34]  K. Lange Reconstruction of Evolutionary Trees , 1997 .

[35]  N. Goldman,et al.  A codon-based model of nucleotide substitution for protein-coding DNA sequences. , 1994, Molecular biology and evolution.

[36]  W. Murphy,et al.  Resolution of the Early Placental Mammal Radiation Using Bayesian Phylogenetics , 2001, Science.

[37]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[38]  Joseph Felsenstein,et al.  Maximum Likelihood and Minimum-Steps Methods for Estimating Evolutionary Trees from Data on Discrete Characters , 1973 .

[39]  Mei Li,et al.  MultiPipMaker and supporting tools: alignments and analysis of multiple genomic DNA sequences , 2003, Nucleic Acids Res..

[40]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[41]  C. Wiuf,et al.  A codon-based model designed to describe lentiviral evolution. , 1998, Molecular biology and evolution.

[42]  J. L. Jensen,et al.  A dependent-rates model and an MCMC-based methodology for the maximum-likelihood analysis of sequences with overlapping reading frames. , 2001, Molecular biology and evolution.

[43]  P. Sharp,et al.  Chromosomal location effects on gene sequence evolution in mammals , 1999, Current Biology.

[44]  L. Hurst,et al.  The proteins of linked genes evolve at similar rates , 2000, Nature.

[45]  William H. Press,et al.  Numerical recipes in C++: the art of scientific computing, 2nd Edition (C++ ed., print. is corrected to software version 2.10) , 1994 .

[46]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[47]  Samuel Karlin,et al.  A First Course on Stochastic Processes , 1968 .

[48]  S. Hess,et al.  The influence of nearest neighbors on the rate and pattern of spontaneous point mutations , 1992, Journal of Molecular Evolution.

[49]  D. Haussler,et al.  Phylogenetic estimation of context-dependent substitution rates by maximum likelihood. , 2003, Molecular biology and evolution.

[50]  Ziheng Yang Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: Approximate methods , 1994, Journal of Molecular Evolution.

[51]  Ian Holmes,et al.  Evolutionary HMMs: a Bayesian approach to multiple alignment , 2001, Bioinform..

[52]  G. Mitchison A Probabilistic Treatment of Phylogeny and Sequence Alignment , 1999, Journal of Molecular Evolution.

[53]  David C. Jones,et al.  Using evolutionary trees in protein secondary structure prediction and other comparative sequence analyses. , 1996, Journal of molecular biology.

[54]  P. Sharp,et al.  Evidence for a high frequency of simultaneous double-nucleotide substitutions. , 2000, Science.

[55]  William H. Press,et al.  Book-Review - Numerical Recipes in Pascal - the Art of Scientific Computing , 1989 .

[56]  H. Kishino,et al.  Dating of the human-ape splitting by a molecular clock of mitochondrial DNA , 2005, Journal of Molecular Evolution.

[57]  Dirk Husmeier,et al.  Detection of Recombination in DNA Multiple Alignments with Hidden Markov Models , 2002, J. Comput. Biol..

[58]  Michael I. Jordan,et al.  Loopy Belief Propagation for Approximate Inference: An Empirical Study , 1999, UAI.

[59]  A. von Haeseler,et al.  A stochastic model for the evolution of autocorrelated DNA sequences. , 1994, Molecular phylogenetics and evolution.

[60]  Christopher B. Burge,et al.  DNA sequence evolution with neighbor-dependent mutation , 2001, RECOMB '02.

[61]  Z. Yang,et al.  Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. , 1993, Molecular biology and evolution.

[62]  J. Felsenstein,et al.  A Hidden Markov Model approach to variation among sites in rate of evolution. , 1996, Molecular biology and evolution.

[63]  J. Felsenstein Evolutionary trees from DNA sequences: A maximum likelihood approach , 2005, Journal of Molecular Evolution.