Protein secondary structure: entropy, correlations and prediction.

MOTIVATION Is protein secondary structure primarily determined by local interactions between residues closely spaced along the amino acid backbone or by non-local tertiary interactions? To answer this question, we measure the entropy densities of primary and secondary structure sequences, and the local inter-sequence mutual information density. RESULTS We find that the important inter-sequence interactions are short ranged, that correlations between neighboring amino acids are essentially uninformative and that only one-fourth of the total information needed to determine the secondary structure is available from local inter-sequence correlations. These observations support the view that the majority of most proteins fold via a cooperative process where secondary and tertiary structure form concurrently. Moreover, existing single-sequence secondary structure prediction algorithms are almost optimal, and we should not expect a dramatic improvement in prediction accuracy. AVAILABILITY Both the data sets and analysis code are freely available from our Web site at http://compbio.berkeley.edu/

[1]  H. Quastler Information theory in psychology : problems and methods , 1955 .

[2]  Ga Miller,et al.  Note on the bias of information estimates , 1955 .

[3]  A. Szent-Gyorgyi,et al.  Role of proline in polypeptide chain configuration of proteins. , 1957, Science.

[4]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[5]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[6]  D. Eisenberg,et al.  A method to identify protein sequences that fold into a known three-dimensional structure. , 1991, Science.

[7]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[8]  D. T. Jones,et al.  A new approach to protein fold recognition , 1992, Nature.

[9]  B. Rost,et al.  Prediction of protein secondary structure at better than 70% accuracy. , 1993, Journal of molecular biology.

[10]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[11]  P. Argos,et al.  Knowledge‐based protein secondary structure assignment , 1995, Proteins.

[12]  J M Chandonia,et al.  Neural networks for secondary structure and structural class predictions , 1995, Protein science : a publication of the Protein Society.

[13]  R Nussinov,et al.  Fast protein fold recognition via sequence to structure alignment and contact capacity potentials. , 1996, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[14]  J M Chandonia,et al.  The importance of larger data sets for protein secondary structure prediction with neural networks , 1996, Protein science : a publication of the Protein Society.

[15]  P. Argos,et al.  Seventy‐five percent accuracy in protein secondary structure prediction , 1997, Proteins.

[16]  R A Goldstein,et al.  Predicting protein secondary structure with probabilistic schemata of evolutionarily derived information , 1997, Protein science : a publication of the Protein Society.

[17]  Takuji Nishimura,et al.  Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator , 1998, TOMC.

[18]  G. Rose,et al.  Is protein folding hierarchic? I. Local structure and peptide folding. , 1999, Trends in biochemical sciences.

[19]  G J Barton,et al.  Evaluation and improvement of multiple sequence methods for protein secondary structure prediction , 1999, Proteins.

[20]  J M Chandonia,et al.  New methods for accurate prediction of protein secondary structure , 1999, Proteins.

[21]  Douglas L. Brutlag,et al.  Bayesian Segmentation of Protein Secondary Structure , 2000, J. Comput. Biol..

[22]  G J Barton,et al.  Application of multiple sequence alignment profiles to improve protein secondary structure prediction , 2000, Proteins.

[23]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[24]  The Distribution Of Entropy Estimators Based On Maximum Mean Log-Likelihood , 2000 .

[25]  Patrice Koehl,et al.  The ASTRAL compendium for protein structure and sequence analysis , 2000, Nucleic Acids Res..

[26]  Brendan J. Frey,et al.  Factor graphs and the sum-product algorithm , 2001, IEEE Trans. Inf. Theory.

[27]  B. Rost Review: protein secondary structure prediction continues to rise. , 2001, Journal of structural biology.

[28]  T. Hubbard,et al.  Critical assessment of methods of protein structure prediction (CASP)‐round V , 2003, Proteins.

[29]  Volker A. Eyrich,et al.  EVA: Large‐scale analysis of secondary structure prediction , 2001, Proteins.

[30]  Terrence G. Oas,et al.  Preorganized secondary structure as an important determinant of fast protein folding , 2001, Nature Structural Biology.

[31]  Patrice Koehl,et al.  ASTRAL compendium enhancements , 2002, Nucleic Acids Res..

[32]  B. Rost,et al.  Alignments grow, secondary structure prediction improves , 2002, Proteins.

[33]  D. Haussler,et al.  Information‐theoretic dissection of pairwise contact potentials , 2002, Proteins.

[34]  K-L Ting,et al.  Combining the GOR V algorithm with evolutionary information for protein secondary structure prediction from amino acid sequence , 2002, Proteins.

[35]  Tim J. P. Hubbard,et al.  SCOP database in 2002: refinements accommodate structural genomics , 2002, Nucleic Acids Res..

[36]  Liam J. McGuffin,et al.  Improvement of the GenTHREADER Method for Genomic Fold Recognition , 2003, Bioinform..

[37]  Brian Gough,et al.  GNU Scientific Library Reference Manual - Third Edition , 2003 .

[38]  T. Speed,et al.  Biological Sequence Analysis , 1998 .

[39]  Alfonso Valencia,et al.  CAFASP3 in the spotlight of EVA , 2003, Proteins.

[40]  J. Crutchfield,et al.  Regularities unseen, randomness observed: levels of entropy convergence. , 2001, Chaos.