Machine Learning: How Much Does It Tell about Protein Folding Rates?

The prediction of protein folding rates is a necessary step towards understanding the principles of protein folding. Due to the increasing amount of experimental data, numerous protein folding models and predictors of protein folding rates have been developed in the last decade. The problem has also attracted the attention of scientists from computational fields, which led to the publication of several machine learning-based models to predict the rate of protein folding. Some of them claim to predict the logarithm of protein folding rate with an accuracy greater than 90%. However, there are reasons to believe that such claims are exaggerated due to large fluctuations and overfitting of the estimates. When we confronted three selected published models with new data, we found a much lower predictive power than reported in the original publications. Overly optimistic predictive powers appear from violations of the basic principles of machine-learning. We highlight common misconceptions in the studies claiming excessive predictive power and propose to use learning curves as a safeguard against those mistakes. As an example, we show that the current amount of experimental data is insufficient to build a linear predictor of logarithms of folding rates based on protein amino acid composition.

[1]  M. Levitt Conformational preferences of amino acids in globular proteins. , 1978, Biochemistry.

[2]  Lukasz A. Kurgan,et al.  Prediction of protein folding rates from primary sequences using hybrid sequence representation , 2009, J. Comput. Chem..

[3]  Liang-Tsung Huang,et al.  Analysis and prediction of protein folding rates using quadratic response surface models , 2008, J. Comput. Chem..

[4]  Oxana V. Galzitskaya,et al.  Coupling between Properties of the Protein Shape and the Rate of Protein Folding , 2009, PloS one.

[5]  D. Baker,et al.  Contact order, transition state placement and the refolding rates of single domain proteins. , 1998, Journal of molecular biology.

[6]  Emidio Capriotti,et al.  K-Fold: a tool for the prediction of the protein folding kinetic order and rate , 2007, Bioinform..

[7]  A V Finkelstein,et al.  Rate of protein folding near the point of thermodynamic equilibrium between the coil and the most stable chain fold. , 1997, Folding & design.

[8]  Z. Zeng,et al.  A Simple Parameter Relating Sequences with Folding Rates of Small α Helical Proteins , 2003 .

[9]  Anna Tramontano,et al.  Critical assessment of methods of protein structure prediction (CASP) — round x , 2014, Proteins.

[10]  A. Fersht,et al.  Mapping the transition state and pathway of protein folding by protein engineering , 1989, Nature.

[11]  Tingting Sun,et al.  Folding rate prediction using n-order contact distance for proteins with two- and three-state folding kinetics. , 2005, Biophysical chemistry.

[12]  E. Shakhnovich,et al.  Chain Length Scaling of Protein Folding Time. , 1996, Physical review letters.

[13]  K. Dill,et al.  The Protein-Folding Problem, 50 Years On , 2012, Science.

[14]  A. Roitberg,et al.  Smaller and faster: the 20-residue Trp-cage protein folds in 4 micros. , 2002, Journal of the American Chemical Society.

[15]  Yi Peng,et al.  A simple parameter relating sequences with folding rates of small alpha helical proteins. , 2003, Protein and peptide letters.

[16]  Linxi Zhang,et al.  Folding rate prediction based on neural network model , 2003 .

[17]  Natalya S. Bogatyreva,et al.  KineticDB: a database of protein folding kinetics , 2008, Nucleic Acids Res..

[18]  M. Karplus,et al.  Kinetics of protein folding. A lattice model study of the requirements for folding to the native state. , 1994, Journal of molecular biology.

[19]  Haipeng Gong,et al.  Local secondary structure content predicts folding rates for simple, two-state proteins. , 2003, Journal of molecular biology.

[20]  M. Michael Gromiha,et al.  A Statistical Model for Predicting Protein Folding Rates from Amino Acid Sequence with Structural Class Information , 2005, J. Chem. Inf. Model..

[21]  Adrian E Roitberg,et al.  Smaller and faster: the 20-residue Trp-cage protein folds in 4 micros. , 2002, Journal of the American Chemical Society.

[22]  S Sugai,et al.  An early immunoreactive folding intermediate of the tryptophan synthase β2 subunit is a ‘molten globule’ , 1990, FEBS letters.

[23]  A. Finkelstein,et al.  Prediction of protein folding rates from the amino acid sequence-predicted secondary structure , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[24]  D. Baker,et al.  Principles for designing ideal protein structures , 2012, Nature.

[25]  D. Thirumalai,et al.  From Minimal Models to Real Proteins: Time Scales for Protein Folding Kinetics , 1995 .

[26]  D Baker,et al.  Mechanisms of protein folding. , 2001, Current opinion in structural biology.

[27]  P. Karplus,et al.  Prediction of chain flexibility in proteins , 1985, Naturwissenschaften.

[28]  A. Finkelstein,et al.  Golden triangle for folding rates of globular proteins , 2012, Proceedings of the National Academy of Sciences.

[29]  [Physical reasons for rapid self-organization of a stable spatial protein structure: solution of the Levinthal paradox]. , 1997 .

[30]  Bin-Guang Ma,et al.  Direct correlation between proteins' folding rates and their amino acid compositions: An ab initio folding rate prediction , 2006, Proteins.

[31]  Dmitry N Ivankov,et al.  Chain length is the main determinant of the folding rate for proteins with three‐state folding kinetics , 2003, Proteins.

[32]  R. Dror,et al.  How Fast-Folding Proteins Fold , 2011, Science.

[33]  Jitao Huang,et al.  Amino acid sequence predicts folding rate for middle‐size two‐state proteins , 2006, Proteins.

[34]  Kevin W Plaxco,et al.  Contact order revisited: Influence of protein size on the folding rate , 2003, Protein science : a publication of the Protein Society.

[35]  Adam Zemla,et al.  Critical assessment of methods of protein structure prediction (CASP)‐round V , 2005, Proteins.

[36]  M. Michael Gromiha,et al.  FOLD-RATE: prediction of protein folding rates from amino acid sequence , 2006, Nucleic Acids Res..

[37]  Thomas A. Hopf,et al.  Three-Dimensional Structures of Membrane Proteins from Genomic Sequencing , 2012, Cell.

[38]  F. Morcos,et al.  Genomics-aided structure prediction , 2012, Proceedings of the National Academy of Sciences.

[39]  S. Jackson,et al.  How do small single-domain proteins fold? , 1998, Folding & design.

[40]  M. Michael Gromiha,et al.  Multiple Contact Network Is a Key Determinant to Protein Folding Rates , 2009, J. Chem. Inf. Model..

[41]  C. Levinthal How to fold graciously , 1969 .