Structure-based protein folding type classification and folding rate prediction

Protein folding rate is one of the important properties of a protein. Protein folding rate prediction is useful for understanding protein folding process and guiding protein design. In this study, we developed a support vector machine (SVM) based method to predict protein folding kinetic types (two-state or non-two-state) and the real-value folding rate using the features calculated from the three-dimensional structure such as contact order, various properties from the non-local contact clusters, secondary structural information and sequence length. We systematically studied the contributions of individual features to folding rate prediction. Based on the highest contributions of individual features, we trained our machine using leave one out cross-validation and tested on a testing dataset. The Pearson correlation coefficient, mean absolute difference and root mean square error between the predicted and experimental folding rates (base-10 logarithmic scale) are 0.814, 0.752 and 0.910 for two-state proteins, and 0.860, 0.687 and 0.876 for non-two-state proteins. Moreover, our method predicts whether a protein of known atomic structure folds according to two-state or non-two-state kinetics and correctly classifies 80% of the folding mechanism on a testing dataset. Finally, we evaluated the performance of our method along with the other eight existing protein folding rate prediction tools on non-overlapping benchmarking dataset. The prediction performance will also be reported and discussed.

[1]  Jie Liang,et al.  Predicting protein folding rates from geometric contact and amino acid sequence , 2008, Protein Science.

[2]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[3]  Hongyi Zhou,et al.  Folding rate prediction using total contact distance. , 2002, Biophysical journal.

[4]  Kunihiro Kuwajima,et al.  Surprisingly high correlation between early and late stages in non-two-state protein folding. , 2006, Journal of molecular biology.

[5]  Jiangning Song,et al.  Towards more accurate prediction of protein folding rates: a review of the existing web-based bioinformatics approaches , 2015, Briefings Bioinform..

[6]  Natalya S. Bogatyreva,et al.  KineticDB: a database of protein folding kinetics , 2008, Nucleic Acids Res..

[7]  M. Gromiha,et al.  Comparison between long-range interactions and contact order in determining the folding rate of two-state proteins: application of long-range order to folding rate prediction. , 2001, Journal of molecular biology.

[8]  Dong Xu,et al.  SeqRate: sequence-based protein folding type classification and rates prediction , 2010, BMC Bioinformatics.

[9]  Amy S. Wagaman,et al.  A comprehensive database of verified experimental data on protein folding kinetics , 2014, Protein science : a publication of the Protein Society.

[10]  Balachandran Manavalan,et al.  Random Forest-Based Protein Model Quality Assessment (RFMQA) Using Structural Features and Potential Energy Terms , 2014, PloS one.

[11]  D. Baker,et al.  Contact order, transition state placement and the refolding rates of single domain proteins. , 1998, Journal of molecular biology.