An Improved Protein Structural Classes Prediction Method by Incorporating Both Sequence and Structure Information.

Protein structural classes information is beneficial for secondary and tertiary structure prediction, protein folds prediction, and protein function analysis. Thus, predicting protein structural classes is of vital importance. In recent years, several computational methods have been developed for low-sequence-similarity (25%-40%) protein structural classes prediction. However, the reported prediction accuracies are actually not satisfactory. Aiming to further improve the prediction accuracies, we propose three different feature extraction methods and construct a comprehensive feature set that captures both sequence and structure information. By applying a random forest (RF) classifier to the feature set, we further develop a novel method for structural classes prediction. We test the proposed method on three benchmark datasets (25PDB, 640, and 1189) with low sequence similarity, and obtain the overall prediction accuracies of 93.5%, 92.6%, and 93.4%, respectively. Compared with six competing methods, the accuracies we achieved are 3.4%, 6.2%, and 8.7% higher than those achieved by the best-performing methods, showing the superiority of our method. Moreover, due to the limitation of the size of the three benchmark datasets, we further test the proposed method on three updated large-scale datasets with different sequence similarities (40%, 30%, and 25%). The proposed method achieves above 90% accuracies for all the three datasets, consistent with the accuracies on the above three benchmark datasets. Experimental results suggest our method as an effective and promising tool for structural classes prediction. Currently, a webserver that implements the proposed method is available on http://121.192.180.204:8080/RF_PSCP/Index.html.