Maximum Discrimination Hidden Markov Models of Sequence Consensus

We introduce a maximum discrimination method for building hidden Markov models (HMMs) of protein or nucleic acid primary sequence consensus. The method compensates for biased representation in sequence data sets, superseding the need for sequence weighting methods. Maximum discrimination HMMs are more sensitive for detecting distant sequence homologs than various other HMM methods or BLAST when tested on globin and protein kinase catalytic domain sequences.

[1]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2]  M. Davison Introduction to Multidimensional Scaling and Its Applications , 1983 .

[3]  W. Taylor,et al.  Identification of protein sequence homology by consensus template alignment. , 1986, Journal of molecular biology.

[4]  S. Wakabayashi,et al.  Primary sequence of a dimeric bacterial haemoglobin from Vitreoscilla , 1986, Nature.

[5]  A. Lesk,et al.  Determinants of a protein fold. Unique features of the globin amino acid sequences. , 1987, Journal of molecular biology.

[6]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[7]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[8]  S F Altschul,et al.  Weights for data related by a tree. , 1989, Journal of molecular biology.

[9]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[10]  S. Karlin,et al.  Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. , 1990, Proceedings of the National Academy of Sciences of the United States of America.

[11]  P. Argos,et al.  Weighting aligned protein or nucleic acid sequences to correct for unequal representation. , 1990, Journal of molecular biology.

[12]  W. Tate,et al.  Evolution of a polymeric globin in the brine shrimp Artemia , 1990, Nature.

[13]  G. Barton Protein multiple sequence alignment and flexible pattern matching. , 1990, Methods in enzymology.

[14]  S. Hanks,et al.  Protein kinase catalytic domain sequence database: identification of conserved features of primary structure and classification of family members. , 1991, Methods in enzymology.

[15]  M. Gilles-Gonzalez,et al.  A haemoprotein with kinase activity encoded by the oxygen sensor of Rhizobium meliloti , 1991, Nature.

[16]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[17]  H. Zhu,et al.  Yeast flavohemoglobin is an ancient protein related to globins and a reductase family. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[18]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[19]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[20]  C. Sander,et al.  Comprehensive sequence analysis of the 182 predicted open reading frames of yeast chromosome III , 1992, Protein science : a publication of the Protein Society.

[21]  M. Potts,et al.  Myoglobin in a Cyanobacterium , 1992, Science.

[22]  T. Takagi,et al.  Amino acid sequence of yeast hemoglobin. A two-domain structure. , 1992, Journal of molecular biology.

[23]  M. Gerstein,et al.  Polar zipper sequence in the high-affinity hemoglobin of Ascaris suum: amino acid sequence and structural interpretation. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[24]  P. Green,et al.  Ancient conserved regions in new gene sequences and the protein databases. , 1993, Science.

[25]  Amos Bairoch,et al.  The PROSITE dictionary of sites and patterns in proteins, its current status , 1993, Nucleic Acids Res..

[26]  Julie Dawn Thompson,et al.  Improved sensitivity of profile searches through the use of sequence weights and gap excision , 1994, Comput. Appl. Biosci..

[27]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[28]  M. A. McClure,et al.  Hidden Markov models of biological primary sequence information. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[29]  P. Bucher,et al.  Improving the sensitivity of the sequence profile method , 1994, Protein science : a publication of the Protein Society.

[30]  S. Henikoff,et al.  Position-based sequence weights. , 1994, Journal of molecular biology.

[31]  S. Henikoff,et al.  Protein family classification based on searching a database of blocks. , 1994, Genomics.

[32]  C. Chothia,et al.  Volume changes in protein evolution. , 1994, Journal of molecular biology.

[33]  R. Durbin,et al.  2.2 Mb of contiguous nucleotide sequence from chromosome III of C. elegans , 1994, Nature.

[34]  Erik L. L. Sonnhammer,et al.  A workbench for large-scale sequence homology analysis , 1994, Comput. Appl. Biosci..

[35]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.