Statistical Significance of Threading Scores

We present a general method for assessing threading score significance. The threading score of a protein sequence, thread onto a given structure, should be compared with the threading score distribution of a random amino-acid sequence, of the same length, thread on the same structure; small p-values point significantly high scores. We claim that, due to general protein contact map properties, this reference distribution is a Weibull extreme value distribution whose parameters depend on the threading method, the structure, the length of the query and the random sequence simulation model used. These parameters can be estimated off-line with simulated sequence samples, for different sequence lengths. They can further be interpolated at the exact length of a query, enabling the quick computation of the p-value.

[1]  L. Haan,et al.  Extreme value theory , 2006 .

[2]  S. Bryant,et al.  Threading a database of protein cores , 1995, Proteins.

[3]  L A Mirny,et al.  Statistical significance of protein structure prediction by threading. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Tim J. P. Hubbard,et al.  SCOP: a Structural Classification of Proteins database , 1999, Nucleic Acids Res..

[5]  Counting and classifying possible protein folds , 1997 .

[6]  R. Fisher,et al.  Limiting forms of the frequency distribution of the largest or smallest member of a sample , 1928, Mathematical Proceedings of the Cambridge Philosophical Society.

[7]  P. Bradley,et al.  Toward High-Resolution de Novo Structure Prediction for Small Proteins , 2005, Science.

[8]  J. R. Wallis,et al.  Estimation of the generalized extreme-value distribution by the method of probability-weighted moments , 1985 .

[9]  D. Cox,et al.  An Analysis of Transformations , 1964 .

[10]  Eric P. Smith,et al.  An Introduction to Statistical Modeling of Extreme Values , 2002, Technometrics.

[11]  P. Kollman,et al.  Pathways to a protein folding intermediate observed in a 1-microsecond simulation in aqueous solution. , 1998, Science.

[12]  Jean-François Gibrat,et al.  Can molecular dynamics simulations help in discriminating correct from erroneous protein 3D models? , 2008, BMC Bioinformatics.

[13]  M. Fréchet Sur la loi de probabilité de l'écart maximum , 1928 .

[14]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[15]  Ruben Recabarren,et al.  Estimating the total number of protein folds , 1999, Proteins.

[16]  B. Gnedenko Sur La Distribution Limite Du Terme Maximum D'Une Serie Aleatoire , 1943 .

[17]  A. Panchenko,et al.  Combination of threading potentials and sequence profiles improves fold recognition. , 2000, Journal of molecular biology.

[18]  Guillaume Collet,et al.  Recent Advances in Solving the Protein Threading Problem , 2007, Grid Computing for Bioinformatics and Computational Biology.

[19]  C. Chothia One thousand families for the molecular biologist , 1992, Nature.

[20]  Jean-François Gibrat,et al.  FROST: A filter‐based fold recognition method , 2002, Proteins.

[21]  S. Bryant,et al.  Statistics of sequence-structure threading. , 1995, Current opinion in structural biology.

[22]  C. Robert,et al.  ABC likelihood-free methods for model choice in Gibbs random fields , 2008, 0807.2767.

[23]  A. Panchenko,et al.  Threading with explicit models for evolutionary conservation of structure and sequence , 1999, Proteins.

[24]  Z. X. Wang,et al.  A re-estimation for the total numbers of protein folds and superfamilies. , 1998, Protein engineering.

[25]  Z. X. Wang,et al.  How many fold types of protein are there in nature? , 1996, Proteins.

[26]  A. Lesk,et al.  The relation between the divergence of sequence and structure in proteins. , 1986, The EMBO journal.

[27]  A. B. Robinson,et al.  Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins. , 1991, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Janet M. Thornton,et al.  Protein domain superfolds and superfamilies , 1994 .