Markovian structures in biological sequence alignments

Abstract The alignment of multiple homologous biopolymer sequences is crucial in research on protein modeling and engineering, molecular evolution, and prediction in terms of both gene function and gene product structure. In this article we provide a coherent view of the two recent models used for multiple sequence alignment—the hidden Markov model (HMM) and the block-based motif model—to develop a set of new algorithms that have both the sensitivity of the block-based model and the flexibility of the HMM. In particular, we decompose the standard HMM into two components: the insertion component, which is captured by the so-called “propagation model,” and the deletion component, which is described by a deletion vector. Such a decomposition serves as a basis for rational compromise between biological specificity and model flexibility. Furthermore, we introduce a Bayesian model selection criterion that—in combination with the propagation model, genetic algorithm, and other computational aspects—forms the cor...

[1]  Gary A. Churchill,et al.  Bayesian Restoration of a Hidden Markov Chain with Applications to DNA Sequencing , 1999, J. Comput. Biol..

[2]  David L. Martin,et al.  Motifs and structural fold of the cofactor binding site of human glutamate decarboxylase , 1998, Protein science : a publication of the Protein Society.

[3]  Jun Zhu,et al.  Bayesian adaptive sequence alignment algorithms , 1998, Bioinform..

[4]  L. Wasserman,et al.  Computing Bayes Factors by Combining Simulation and Asymptotic Approximations , 1997 .

[5]  R. Durbin,et al.  Pfam: A comprehensive database of protein domain families based on seed alignments , 1997, Proteins.

[6]  D. Lipman,et al.  Extracting protein alignment models from the sequence database. , 1997, Nucleic acids research.

[7]  G Taubes Computational Molecular Biology: Software Matchmakers Help Make Sense of Sequences , 1996, Science.

[8]  E Marshall Hot Property: Biologists Who Compute , 1996, Science.

[9]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[10]  Charles E. Lawrence,et al.  Likelihood inference for permuted data with application to gene regulation , 1996 .

[11]  J M Claverie,et al.  Effective large-scale sequence similarity searches. , 1996, Methods in enzymology.

[12]  Xiao-Li Meng,et al.  POSTERIOR PREDICTIVE ASSESSMENT OF MODEL FITNESS VIA REALIZED DISCREPANCIES , 1996 .

[13]  Jun S. Liu,et al.  Bayesian Models for Multiple Local Sequence Alignment and Gibbs Sampling Strategies , 1995 .

[14]  S. Henikoff,et al.  Automated construction and graphical presentation of protein blocks from unaligned sequences. , 1995, Gene.

[15]  Peter N. Campbell,et al.  Biochemistry (2nd edn) , 1995 .

[16]  Jun S. Liu,et al.  Gibbs motif sampling: Detection of bacterial outer membrane protein repeats , 1995, Protein science : a publication of the Protein Society.

[17]  Donald Voet,et al.  Biochemistry, 2nd ed. , 1995 .

[18]  Michael S. Waterman,et al.  Introduction to computational biology , 1995 .

[19]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[20]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[21]  Jun S. Liu,et al.  The Collapsed Gibbs Sampler in Bayesian Computations with Applications to a Gene Regulation Problem , 1994 .

[22]  D. Ward,et al.  Mutation in the DNA mismatch repair gene homologue hMLH 1 is associated with hereditary non-polyposis colon cancer , 1994, Nature.

[23]  M. A. McClure,et al.  Hidden Markov models of biological primary sequence information. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[24]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[25]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[26]  P. Green,et al.  Ancient conserved regions in new gene sequences and the protein databases. , 1993, Science.

[27]  I G Old,et al.  Nucleotide sequence of the Borrelia burgdorferi rpmH gene encoding ribosomal protein L34. , 1992, Nucleic acids research.

[28]  S. Karlin,et al.  Chance and statistical significance in protein and DNA sequence analysis. , 1992, Science.

[29]  Lloyd Allison,et al.  Minimum message length encoding, evolutionary trees and multiple-alignment , 1992, Proceedings of the Twenty-Fifth Hawaii International Conference on System Sciences.

[30]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[31]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[32]  T. D. Schneider,et al.  Sequence logos: a new way to display consensus sequences. , 1990, Nucleic acids research.

[33]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[34]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[35]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[36]  G. Churchill Stochastic models for heterogeneous DNA sequences. , 1989, Bulletin of mathematical biology.

[37]  M. West,et al.  Bayesian forecasting and dynamic models , 1989 .

[38]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[39]  M. Bishop,et al.  Maximum likelihood alignment of DNA sequences. , 1986, Journal of molecular biology.

[40]  D. Rubin Bayesianly Justifiable and Relevant Frequency Calculations for the Applied Statistician , 1984 .

[41]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[42]  George E. P. Box,et al.  Sampling and Bayes' inference in scientific modelling and robustness , 1980 .

[43]  S. B. Needleman,et al.  A general method applicable to the search for similarities in the amino acid sequence of two proteins. , 1970, Journal of molecular biology.