Score Distributions for Simultaneous Matching to Multiple Motifs

Several computer algorithms now exist for discovering multiple motifs (expressed as weight matrices) that characterize a family of protein sequences known to be homologous. This paper describes a method for performing similarity searches of protein sequence databases using such a group of motifs. By simultaneously using all the motifs that characterize a protein family, the sensitivity and specificity of the database search are increased. We define the p-value for a target sequence to be the probability of a random sequence of the same length scoring as well or better in comparison to all the motifs that characterize the family. (The p-value of a database search can be determined from this value and the size of the database.) We show that estimating the distribution of single motif scores by a Gaussian extreme value distribution is insufficiently accurate to provide a useful estimate of the p-value, but that this deficiency can be corrected by reestimating the parameters of the underlying Gaussian distribution from observed scores for comparison of a given motif and sequence database. These parameters are used to calculate a "reduced variate" which has a Gumbel limiting distribution. Multiple motif scores are combined to give a single p-value by using the sum of the reduced variates for the motif scores as the test statistic. We give a computationally efficient approximation to the distribution of the sum of independent Gumbel random variables and verify experimentally that it closely approximates the distribution of the test statistic. Experiments on pseudorandom sequences show that the approximated p-values are conservative, so the significance of high scores in database searches will not be overstated. Experiments with real protein sequences and motifs identified by the MEME algorithm show that determining an overall p-value based on the combination of multiple motifs gives significantly better database search results than using p-values of single motifs.

[1]  Michael S. Waterman,et al.  Approximations to Profile Score Distributions , 1994, J. Comput. Biol..

[2]  S F Altschul,et al.  Local alignment statistics. , 1996, Methods in enzymology.

[3]  A. Bairoch,et al.  The SWISS-PROT protein sequence data bank: current status. , 1994, Nucleic acids research.

[4]  R. R. Kinnison,et al.  Applied extreme value statistics , 1984 .

[5]  Peter Hall,et al.  Estimating probabilities for normal extremes , 1980, Advances in Applied Probability.

[6]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[7]  William H. Press,et al.  Numerical recipes , 1990 .

[8]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[9]  W. Pearson Rapid and sensitive sequence comparison with FASTP and FASTA. , 1990, Methods in enzymology.

[10]  G. Stormo Consensus patterns in DNA. , 1990, Methods in enzymology.

[11]  Michael Gribskov,et al.  Use of Receiver Operating Characteristic (ROC) Analysis to Evaluate Sequence Matching , 1996, Comput. Chem..

[12]  S. Henikoff,et al.  Automated assembly of protein blocks for database searching. , 1991, Nucleic acids research.

[13]  Amos Bairoch,et al.  The PROSITE database, its status in 1995 , 1996, Nucleic Acids Res..

[14]  M S Waterman,et al.  Identification of common molecular subsequences. , 1981, Journal of molecular biology.

[15]  Jun S. Liu,et al.  Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment. , 1993, Science.

[16]  Shmuel Pietrokovski,et al.  The Blocks database--a system for protein classification , 1996, Nucleic Acids Res..

[17]  D. Haussler,et al.  Hidden Markov models in computational biology. Applications to protein modeling. , 1993, Journal of molecular biology.

[18]  M. Gribskov,et al.  [9] Profile analysis , 1990 .

[19]  Charles Elkan,et al.  Unsupervised learning of multiple motifs in biopolymers using expectation maximization , 1995, Mach. Learn..