Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families

A Bayesian method for estimating the amino acid distributions in the states of a hidden Markov model (HMM) for a protein family or the columns of a multiple alignment of that family is introduced. This method uses Dirichlet mixture densities as priors over amino acid distributions. These mixture densities are determined from examination of previously constructed HMMs or multiple alignments. It is shown that this Bayesian method can improve the quality of HMMs produced from small training sets. Specific experiments on the EF-hand motif are reported, for which these priors are shown to produce HMMs with higher likelihood on unseen data, and fewer false positives and false negatives in a database search task.

[1]  G. Fasman Prediction of Protein Structure and the Principles of Protein Conformation , 2012, Springer US.

[2]  W. Taylor,et al.  The classification of amino acid conservation. , 1986, Journal of theoretical biology.

[3]  M. Sternberg,et al.  Flexible protein sequence patterns. A sensitive method to detect weak structural similarities. , 1990, Journal of molecular biology.

[4]  C. Antoniak Mixtures of Dirichlet Processes with Applications to Bayesian Nonparametric Problems , 1974 .

[5]  Collin M. Stultz,et al.  Structural analysis based on state‐space modeling , 1993, Protein science : a publication of the Protein Society.

[6]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[7]  A. Persechini,et al.  The EF-hand family of calcium-modulated proteins , 1989, Trends in Neurosciences.

[8]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[9]  G. Churchill Stochastic models for heterogeneous DNA sequences. , 1989, Bulletin of mathematical biology.

[10]  D. G. Simpson,et al.  The Statistical Analysis of Discrete Data , 1989 .

[11]  M. Gribskov,et al.  Profile Analysis , 1970 .

[12]  Pierre Baldi,et al.  Smooth On-Line Learning Algorithms for Hidden Markov Models , 1994, Neural Computation.

[13]  Collin M. Stultz,et al.  Protein classification by stochastic modeling and optimal filtering of amino-acid sequences. , 1994, Mathematical Biosciences.

[14]  Richard Hughey,et al.  Massively Parallel Biosequence Analysis , 1993 .

[15]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[16]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[17]  S. Hanks,et al.  Protein kinase catalytic domain sequence database: identification of conserved features of primary structure and classification of family members. , 1991, Methods in enzymology.

[18]  M J Sternberg,et al.  Machine learning approach for the prediction of protein secondary structure. , 1990, Journal of molecular biology.

[19]  D. Haussler,et al.  Protein modeling using hidden Markov models: analysis of globins , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[20]  M. Waterman,et al.  Line geometries for sequence comparisons , 1984 .

[21]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[22]  Satoru Hayamizu,et al.  HMM with protein structure grammar , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[23]  A. D. McLachlan,et al.  Secondary structure‐based profiles: Use of structure‐conserving scoring tables in searching protein sequence databases for structural similarities , 1991, Proteins.

[24]  D. Eisenberg,et al.  A method to identify protein sequences that fold into a known three-dimensional structure. , 1991, Science.

[25]  M. Sternberg Prediction of protein structure and the principles of protein conformation , 1990 .

[26]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[27]  M. Gribskov,et al.  [9] Profile analysis , 1990 .