Weighting hidden Markov models for maximum discrimination

MOTIVATION Hidden Markov models can efficiently and automatically build statistical representations of related sequences. Unfortunately, training sets are frequently biased toward one subgroup of sequences, leading to an insufficiently general model. This work evaluates sequence weighting methods based on the maximum-discrimination idea. RESULTS One good method scales sequence weights by an exponential that ranges between 0.1 for the best scoring sequence and 1.0 for the worst. Experiments with a curated data set show that while training with one or two sequences performed worse than single-sequence Probabilistic Smith-Waterman, training with five or ten sequences reduced errors by 20% and 51%, respectively. This new version of the SAM HMM suite outperforms HMMer (17% reduction over PSW for 10 training sequences), Meta-MEME (28% reduction), and unweighted SAM (31% reduction). AVAILABILITY A WWW server, as well as information on obtaining the Sequence Alignment and Modeling (SAM) software suite and additional data from this work, can be found at http://www.cse.ucse. edu/research/compbio/sam.html

[1]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[2]  Amos Bairoch,et al.  A Generalized Profile Syntax for Biomolecular Sequence Motifs and its Function in Automatic Sequence Interpretation , 1994, ISMB.

[3]  Philipp Bucher,et al.  A Sequence Similarity Search Algorithm Based on a Probabilistic Interpretation of an Alignment Scoring System , 1996, ISMB.

[4]  W. Pearson Comparison of methods for searching protein sequence databases , 1995, Protein science : a publication of the Protein Society.

[5]  William Noble Grundy,et al.  Meta-MEME: motif-based hidden Markov models of protein families , 1997, Comput. Appl. Biosci..

[6]  C Sander,et al.  Predicting protein structure using hidden Markov models , 1997, Proteins.

[7]  Chris Smith,et al.  Parameterization Studies for the SAM and HMMER Methods of Hidden Markov Model Generation , 1996, ISMB.

[8]  Aleksandar Milosavljevic,et al.  Discovering simple DNA sequences by the algorithmic significance method , 1993, Comput. Appl. Biosci..

[9]  Pankaj Agarwal,et al.  Comparative accuracy of methods for protein sequence similarity search , 1998, Bioinform..

[10]  Stephen M. Mount,et al.  Sequence similarity , 2020, Nature.

[11]  William Noble Grundy,et al.  Family-based homology detection via pairwise sequence comparison , 1998, RECOMB '98.

[12]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[13]  Anders Krogh,et al.  Hidden Markov models for sequence analysis: extension and analysis of the basic method , 1996, Comput. Appl. Biosci..

[14]  William R. Pearson,et al.  Identifying distantly related protein sequences. , 1997, Computer applications in the biosciences : CABIOS.

[15]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[16]  Kevin Karplus,et al.  A Flexible Motif Search Technique Based on Generalized Profiles , 1996, Comput. Chem..

[17]  S. Henikoff,et al.  Amino acid substitution matrices. , 2000, Advances in protein chemistry.

[18]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[19]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[21]  S. Henikoff,et al.  Position-based sequence weights. , 1994, Journal of molecular biology.

[22]  D. Haussler,et al.  Protein modeling using hidden Markov models: analysis of globins , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[23]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[24]  Richard Hughey,et al.  Reduced space hidden Markov model training , 1998, Bioinform..

[25]  Richard Hughey,et al.  Scoring hidden Markov models , 1997, Comput. Appl. Biosci..

[26]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[27]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[28]  K. Karplus REGULARIZERS FOR ESTIMATING DISTRIBUTIONS OF AMINO ACIDS FROM SMALL SAMPLES , 1995 .

[29]  M. A. McClure,et al.  Hidden Markov models of biological primary sequence information. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[30]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[31]  Sean R. Eddy,et al.  Maximum Discrimination Hidden Markov Models of Sequence Consensus , 1995, J. Comput. Biol..