Transition Priors for Protein Hidden Markov Models: An Empirical Study towards Maximum Discrimination

Insertions and deletions in a profile hidden Markov model (HMM) are modeled by transition probabilities between insert, delete and match states. These are estimated by combining observed data and prior probabilities. The transition prior probabilities can be defined either ad hoc or by maximum likelihood (ML) estimation. We show that the choice of transition prior greatly affects the HMM's ability to discriminate between true and false hits. HMM discrimination was measured using the HMMER 2.2 package applied to 373 families from Pfam. We measured the discrimination between true members and noise sequences employing various ML transition priors and also systematically scanned the parameter space of ad hoc transition priors. Our results indicate that ML priors produce far from optimal discrimination, and we present an empirically derived prior that considerably decreases the number of misclassifications compared to ML. Most of the difference stems from the probabilities for exiting a delete state. The ML prior, which is unaware of noise sequences, estimates a delete-to-delete probability that is relatively high and does not penalize noise sequences enough for optimal discrimination.

[1]  E. Lindahl,et al.  Identification of related proteins on family, superfamily and fold level. , 2000, Journal of molecular biology.

[2]  Anders Krogh,et al.  Maximum Entropy Weighting of Aligned Sequences of Proteins or DNA , 1995, ISMB.

[3]  David Haussler,et al.  Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families , 1993, ISMB.

[4]  Sean R. Eddy,et al.  Profile hidden Markov models , 1998, Bioinform..

[5]  Jorja G. Henikoff,et al.  Using substitution probabilities to improve position-specific scoring matrices , 1996, Comput. Appl. Biosci..

[6]  David Haussler,et al.  Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology , 1996, Comput. Appl. Biosci..

[7]  Richard Hughey,et al.  Hidden Markov models for detecting remote protein homologies , 1998, Bioinform..

[8]  C. Chothia,et al.  Volume changes in protein evolution. , 1994, Journal of molecular biology.

[9]  Sean R. Eddy,et al.  Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids , 1998 .

[10]  M. Madera,et al.  A comparison of profile hidden Markov model procedures for remote homology detection. , 2002, Nucleic acids research.

[11]  S. Henikoff,et al.  Position-based sequence weights. , 1994, Journal of molecular biology.

[12]  Markus Wistrand,et al.  Improving profile HMM discrimination by adapting transition probabilities. , 2004, Journal of molecular biology.

[13]  Peer Bork,et al.  SMART, a simple modular architecture research tool , 1998 .

[14]  Richard Hughey,et al.  Weighting hidden Markov models for maximum discrimination , 1998, Bioinform..

[15]  Fan Yang,et al.  TIGRFAMs: a protein family resource for the functional identification of proteins , 2001, Nucleic Acids Res..

[16]  Pierre Baldi,et al.  Bioinformatics - the machine learning approach (2. ed.) , 2000 .

[17]  D. Haussler,et al.  Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. , 1998, Journal of molecular biology.

[18]  S. Altschul,et al.  Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[19]  C. Chothia,et al.  Assignment of homology to genome sequences using a library of hidden Markov models that represent all proteins of known structure. , 2001, Journal of molecular biology.

[20]  Anders Krogh,et al.  Hidden Markov models for sequence analysis: extension and analysis of the basic method , 1996, Comput. Appl. Biosci..

[21]  Sean R. Eddy,et al.  Maximum Discrimination Hidden Markov Models of Sequence Consensus , 1995, J. Comput. Biol..

[22]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.