An expectation maximization algorithm for training hidden substitution models.

We derive an expectation maximization algorithm for maximum-likelihood training of substitution rate matrices from multiple sequence alignments. The algorithm can be used to train hidden substitution models, where the structural context of a residue is treated as a hidden variable that can evolve over time. We used the algorithm to train hidden substitution matrices on protein alignments in the Pfam database. Measuring the accuracy of multiple alignment algorithms with reference to BAliBASE (a database of structural reference alignments) our substitution matrices consistently outperform the PAM series, with the improvement steadily increasing as up to four hidden site classes are added. We discuss several applications of this algorithm in bioinformatics.

[1]  Durbin,et al.  Biological Sequence Analysis , 1998 .

[2]  Samuel Karlin,et al.  A First Course on Stochastic Processes , 1968 .

[3]  W. Bruno Modeling residue usage in aligned protein sequences via maximum likelihood. , 1996, Molecular biology and evolution.

[4]  Bjarne Knudsen,et al.  RNA secondary structure prediction using stochastic context-free grammars and evolutionary history , 1999, Bioinform..

[5]  M. O. Dayhoff,et al.  Atlas of protein sequence and structure , 1965 .

[6]  Z. Yang,et al.  A space-time process model for the evolution of DNA sequences. , 1995, Genetics.

[7]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[8]  Ian Holmes,et al.  Evolutionary HMMs: a Bayesian approach to multiple alignment , 2001, Bioinform..

[9]  Olivier Poch,et al.  A comprehensive comparison of multiple sequence alignment programs , 1999, Nucleic Acids Res..

[10]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems , 1988 .

[11]  A. Halpern,et al.  Evolutionary distances for protein-coding sequences: modeling site-specific residue frequencies. , 1998, Molecular biology and evolution.

[12]  Thomas Lengauer,et al.  Proceedings of the Fifth Annual International Conference on Computational Biology, RECOMB 2001, Montréal, Québec, Canada, April 22-25, 2001 , 2001, Annual International Conference on Research in Computational Molecular Biology.

[13]  Richard A. Goldstein,et al.  Analyzing Rate Heterogeneity During Protein Evolution , 2000, Pacific Symposium on Biocomputing.

[14]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[15]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[16]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[17]  Matthew W. Dimmic,et al.  Modeling evolution at the protein level using an adjustable amino acid fitness model. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[18]  Tal Pupko,et al.  A structural EM algorithm for phylogenetic inference , 2001, J. Comput. Biol..

[19]  J. Felsenstein,et al.  A Hidden Markov Model approach to variation among sites in rate of evolution. , 1996, Molecular biology and evolution.

[20]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[21]  David Haussler,et al.  Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families , 1993, ISMB.

[22]  Jens Timmer,et al.  Estimating rate constants in hidden Markov models by the EM algorithm , 1999, IEEE Trans. Signal Process..

[23]  C. Branden,et al.  Introduction to protein structure , 1991 .

[24]  Robert D. Finn,et al.  The Pfam protein families database , 2004, Nucleic Acids Res..