Hidden Markov Models in Computational Biology: Applications to Protein Modeling UCSC-CRL-93-32

Hidden Markov Models (HMMs) are applied to the problems of statistical model-ing, database searching and multiple sequence alignment of protein families and protein domains. These methods are demonstrated on the globin family, the protein kinase catalytic domain, and the EF-hand calcium binding motif. In each case the parameters of an HMM are estimated from a training set of unaligned sequences. After the HMM is built, it is used to obtain a multiple alignment of all the training sequences. It is also used to search the SWISS-PROT 22 database for other sequences that are members of the given protein family, or contain the given domain. The HMM produces multiple alignments of good quality that agree closely with the alignments produced by programs that incorporate three-dimensional structural information. When employed in discrimination tests (by examining how closely the sequences in a database t the globin, kinase and EF-hand HMMs), the HMM is able to distinguish members of these families from non-members with a high degree of accuracy. Both the HMM and PRO-FILESEARCH (a technique used to search for relationships between a protein sequence and multiply aligned sequences) perform better in these tests than PROSITE (a dictionary of sites and patterns in proteins). The HMM appears to have a slight advantage 1 over PROFILESEARCH in terms of lower rates of false negatives and false positives, even though the HMM is trained using only unaligned sequences, whereas PROFILE-SEARCH requires aligned training sequences. Our results suggest the presence of an EF-hand calcium binding motif in a highly conserved and evolutionarily preserved pu-tative intracellular region of 155 residues in the-1 subunit of L-type calcium channels which play an important role in excitation-contraction coupling. This region has been suggested to contain the functional domains that are typical or essential for all L-type calcium channels regardless of whether they couple to ryanodine receptors, conduct ions or both.

[1]  P. Argos,et al.  Motif recognition and alignment for many sequences by comparison of dot-matrices. , 1991, Journal of molecular biology.

[2]  M. Waterman,et al.  Line geometries for sequence comparisons , 1984 .

[3]  D. Garbers Guanylyl cyclase receptors and their endocrine, paracrine, and autocrine ligands , 1992, Cell.

[4]  A. Cohen,et al.  Finite Mixture Distributions , 1982 .

[5]  T. Hunter,et al.  Dual-specificity protein kinases: will any hydroxyl do? , 1992, Trends in biochemical sciences.

[6]  Steven J. Nowlan,et al.  Maximum Likelihood Competitive Learning , 1989, NIPS.

[7]  A. Persechini,et al.  The EF-hand family of calcium-modulated proteins , 1989, Trends in Neurosciences.

[8]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[9]  Satoru Hayamizu,et al.  HMM with protein structure grammar , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[10]  S Subbiah,et al.  A method for multiple sequence alignment with gaps. , 1989, Journal of molecular biology.

[11]  W. Koch,et al.  Calcium channels from Cyprinus carpio skeletal muscle. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[12]  W. Taylor,et al.  The classification of amino acid conservation. , 1986, Journal of theoretical biology.

[13]  M. Sternberg,et al.  Flexible protein sequence patterns. A sensitive method to detect weak structural similarities. , 1990, Journal of molecular biology.

[14]  P. Argos,et al.  Weighting aligned protein or nucleic acid sequences to correct for unequal representation. , 1990, Journal of molecular biology.

[15]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[16]  Richard Earl Dickerson,et al.  Hemoglobin : structure, function, evolution, and pathology , 1983 .

[17]  A. Lesk,et al.  Determinants of a protein fold. Unique features of the globin amino acid sequences. , 1987, Journal of molecular biology.

[18]  E. Lander,et al.  Construction of multilocus genetic linkage maps in humans. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[19]  Collin M. Stultz,et al.  Structural analysis based on state‐space modeling , 1993, Protein science : a publication of the Protein Society.

[20]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[21]  A. D. McLachlan,et al.  Secondary structure‐based profiles: Use of structure‐conserving scoring tables in searching protein sequence databases for structural similarities , 1991, Proteins.

[22]  T. Hunter,et al.  The protein kinase family: conserved features and deduced phylogeny of the catalytic domains. , 1988, Science.

[23]  D. Eisenberg,et al.  A method to identify protein sequences that fold into a known three-dimensional structure. , 1991, Science.

[24]  David Haussler,et al.  Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families , 1993, ISMB.

[25]  D. Haussler,et al.  Protein modeling using hidden Markov models: analysis of globins , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[26]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[27]  M. Brunori Hemoglobin: Structure, function, evolution, and pathology: by R. E. Dickerson and I. Geis, The Benjamin/Cummings Publishing Company, 1983. $29.95 (viii + 176) ISBN 0 805 32411 9 , 1984 .

[28]  G. Barton Protein multiple sequence alignment and flexible pattern matching. , 1990, Methods in enzymology.

[29]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[30]  Chris Sander Databases of homology-derived protein structures , 1990 .

[31]  D. Haussler,et al.  A hidden Markov model that finds genes in E. coli DNA. , 1994, Nucleic acids research.

[32]  S. Hanks,et al.  Protein kinase catalytic domain sequence database: identification of conserved features of primary structure and classification of family members. , 1991, Methods in enzymology.

[33]  Douglas L. Brutlag,et al.  Detection of Correlations in tRNA Sequences with Structural Implications , 1993, ISMB.

[34]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[35]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[36]  A. Bairoch PROSITE: a dictionary of sites and patterns in proteins. , 1991, Nucleic acids research.

[37]  J. Zheng,et al.  Crystal structure of the catalytic subunit of cyclic adenosine monophosphate-dependent protein kinase. , 1991, Science.

[38]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[39]  Naoki Abe,et al.  On the Computational Complexity of Approximating Distributions by Probabilistic Automata , 1990, Annual Conference Computational Learning Theory.

[40]  D. Haussler,et al.  Stochastic context-free grammars for modeling RNA , 1993, 1994 Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences.

[41]  G. Churchill Stochastic models for heterogeneous DNA sequences. , 1989, Bulletin of mathematical biology.

[42]  G. Stormo,et al.  Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. , 1992, Journal of molecular biology.

[43]  D. G. Simpson,et al.  The Statistical Analysis of Discrete Data , 1989 .

[44]  Pierre Baldi,et al.  Hidden Markov Models in Molecular Biology: New Algorithms and Applications , 1992, NIPS.