Dirichlet mixtures: a method for improved detection of weak but significant protein sequence homology

We present a method for condensing the information in multiple alignments of proteins into a mixture of Dirichlet densities over amino acid distributions. Dirichlet mixture densities are designed to be combined with observed amino acid frequencies to form estimates of expected amino acid probabilities at each position in a profile, hidden Markov model or other statistical model. These estimates give a statistical model greater generalization capacity, so that remotely related family members can be more reliably recognized by the model. This paper corrects the previously published formula for estimating these expected probabilities, and contains complete derivations of the Dirichlet mixture formulas, methods for optimizing the mixtures to match particular databases, and suggestions for efficient implementation.

[1]  I. S. Gradshteyn,et al.  Table of Integrals, Series, and Products , 1976 .

[2]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[3]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[4]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[5]  M. O. Dayhoff,et al.  22 A Model of Evolutionary Change in Proteins , 1978 .

[6]  M. Gribskov,et al.  The codon preference plot: graphic analysis of protein coding sequences and prediction of gene expression , 1984, Nucleic Acids Res..

[7]  M. Waterman,et al.  Line geometries for sequence comparisons , 1984 .

[8]  R. Doolittle Of urfs and orfs : a primer on how to analyze devised amino acid sequences , 1986 .

[9]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[10]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[11]  Steven J. Nowlan,et al.  Maximum Likelihood Competitive Learning , 1989, NIPS.

[12]  D. G. Simpson,et al.  The Statistical Analysis of Discrete Data , 1989 .

[13]  A. Persechini,et al.  The EF-hand family of calcium-modulated proteins , 1989, Trends in Neurosciences.

[14]  G. Churchill Stochastic models for heterogeneous DNA sequences. , 1989, Bulletin of mathematical biology.

[15]  D. G. George,et al.  Mutation data matrix and its uses. , 1990, Methods in enzymology.

[16]  M. Gribskov,et al.  [9] Profile analysis , 1990 .

[17]  S. Henikoff,et al.  Finding protein similarities with nucleotide sequence databases. , 1990, Methods in enzymology.

[18]  P. Argos,et al.  Weighting aligned protein or nucleic acid sequences to correct for unequal representation. , 1990, Journal of molecular biology.

[19]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[20]  M. Sternberg,et al.  Flexible protein sequence patterns. A sensitive method to detect weak structural similarities. , 1990, Journal of molecular biology.

[21]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[22]  C. Sander,et al.  Database of homology‐derived protein structures and the structural meaning of sequence alignment , 1991, Proteins.

[23]  A. D. McLachlan,et al.  Secondary structure‐based profiles: Use of structure‐conserving scoring tables in searching protein sequence databases for structural similarities , 1991, Proteins.

[24]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[25]  D. Eisenberg,et al.  A method to identify protein sequences that fold into a known three-dimensional structure. , 1991, Science.

[26]  S. Altschul Amino acid substitution matrices from an information theoretic perspective , 1991, Journal of Molecular Biology.

[27]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[28]  G. Stormo,et al.  Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. , 1992, Journal of molecular biology.

[29]  William R. Taylor,et al.  The rapid generation of mutation data matrices from protein sequences , 1992, Comput. Appl. Biosci..

[30]  D. Haussler,et al.  Protein modeling using hidden Markov models: analysis of globins , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[31]  Collin M. Stultz,et al.  Structural analysis based on state‐space modeling , 1993, Protein science : a publication of the Protein Society.

[32]  Jean-Michel Claverie,et al.  Information Enhancement Methods for Large Scale Sequence Analysis , 1993, Comput. Chem..

[33]  Satoru Hayamizu,et al.  HMM with protein structure grammar , 1993, [1993] Proceedings of the Twenty-sixth Hawaii International Conference on System Sciences.

[34]  David Haussler,et al.  Using Dirichlet Mixture Priors to Derive Hidden Markov Models for Protein Families , 1993, ISMB.

[35]  S Henikoff,et al.  Performance evaluation of amino acid substitution matrices , 1993, Proteins.

[36]  Richard Hughey,et al.  Massively Parallel Biosequence Analysis , 1993 .

[37]  Julie Dawn Thompson,et al.  Improved sensitivity of profile searches through the use of sequence weights and gap excision , 1994, Comput. Appl. Biosci..

[38]  Pierre Baldi,et al.  Smooth On-Line Learning Algorithms for Hidden Markov Models , 1994, Neural Computation.

[39]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[40]  M. S. Johnson,et al.  Residue–Residue contact substitution probabilities derived from aligned three‐dimensional structures and the identification of common folds , 1994, Protein science : a publication of the Protein Society.

[41]  J. Thompson,et al.  CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. , 1994, Nucleic acids research.

[42]  S. Altschul,et al.  Detection of conserved segments in proteins: iterative scanning of sequence databases with alignment blocks. , 1994, Proceedings of the National Academy of Sciences of the United States of America.

[43]  Jean-Michel Claverie,et al.  Some Useful Statistical Properties of Position-weight Matrices , 1994, Comput. Chem..

[44]  S. Henikoff,et al.  Position-based sequence weights. , 1994, Journal of molecular biology.

[45]  Sean R. Eddy,et al.  Maximum Discrimination Hidden Markov Models of Sequence Consensus , 1995, J. Comput. Biol..

[46]  Sean R. Eddy,et al.  Multiple Alignment Using Hidden Markov Models , 1995, ISMB.

[47]  R. Fleischmann,et al.  Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. , 1995, Science.

[48]  K. Karplus REGULARIZERS FOR ESTIMATING DISTRIBUTIONS OF AMINO ACIDS FROM SMALL SAMPLES , 1995 .

[49]  Charles Elkan,et al.  The Value of Prior Knowledge in Discovering Motifs with MEME , 1995, ISMB.

[50]  S. Eddy Hidden Markov models. , 1996, Current opinion in structural biology.

[51]  B A Shapiro,et al.  Complementary classification approaches for protein sequences. , 1996, Protein engineering.

[52]  Kevin Karplus,et al.  A Flexible Motif Search Technique Based on Generalized Profiles , 1996, Comput. Chem..

[53]  Jorja G. Henikoff,et al.  Using substitution probabilities to improve position-specific scoring matrices , 1996, Comput. Appl. Biosci..

[54]  Anders Krogh,et al.  Hidden Markov models for sequence analysis: extension and analysis of the basic method , 1996, Comput. Appl. Biosci..