An ensemble approach to protein fold classification by integration of template‐based assignment and support vector machine classifier

Motivation: Protein fold classification is a critical step in protein structure prediction. There are two possible ways to classify protein folds. One is through template‐based fold assignment and the other is ab‐initio prediction using machine learning algorithms. Combination of both solutions to improve the prediction accuracy was never explored before. Results: We developed two algorithms, HH‐fold and SVM‐fold for protein fold classification. HH‐fold is a template‐based fold assignment algorithm using the HHsearch program. SVM‐fold is a support vector machine‐based ab‐initio classification algorithm, in which a comprehensive set of features are extracted from three complementary sequence profiles. These two algorithms are then combined, resulting to the ensemble approach TA‐fold. We performed a comprehensive assessment for the proposed methods by comparing with ab‐initio methods and template‐based threading methods on six benchmark datasets. An accuracy of 0.799 was achieved by TA‐fold on the DD dataset that consists of proteins from 27 folds. This represents improvement of 5.4–11.7% over ab‐initio methods. After updating this dataset to include more proteins in the same folds, the accuracy increased to 0.971. In addition, TA‐fold achieved >0.9 accuracy on a large dataset consisting of 6451 proteins from 184 folds. Experiments on the LE dataset show that TA‐fold consistently outperforms other threading methods at the family, superfamily and fold levels. The success of TA‐fold is attributed to the combination of template‐based fold assignment and ab‐initio classification using features from complementary sequence profiles that contain rich evolution information. Availability and Implementation: http://yanglab.nankai.edu.cn/TA‐fold/ Contact: yangjy@nankai.edu.cn or mhb‐506@163.com Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[2]  Johannes Söding,et al.  The HHpred interactive server for protein homology detection and structure prediction , 2005, Nucleic Acids Res..

[3]  James G. Lyons,et al.  A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition. , 2013, Journal of theoretical biology.

[4]  Lukasz A. Kurgan,et al.  Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences , 2009, BMC Bioinformatics.

[5]  Jason Weston,et al.  Multi-class protein fold recognition using adaptive codes , 2005, ICML.

[6]  Jianyi Yang,et al.  Improving taxonomy‐based protein fold recognition by using global and local features , 2011, Proteins.

[7]  Xin Chen,et al.  Prediction of protein structural classes for low-homology sequences based on predicted secondary structure , 2010, BMC Bioinformatics.

[8]  Yaoqi Zhou,et al.  Improving protein fold recognition and template-based modeling by employing probabilistic-based matching between predicted one-dimensional structural properties of query and corresponding native properties of templates , 2011, Bioinform..

[9]  Ke Fan,et al.  The number of protein folds and their distribution over families in nature , 2004, Proteins.

[10]  A. Biegert,et al.  HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment , 2011, Nature Methods.

[11]  James G. Lyons,et al.  Advancing the Accuracy of Protein Fold Recognition by Utilizing Profiles From Hidden Markov Models , 2015, IEEE Transactions on NanoBioscience.

[12]  Yixue Li,et al.  Large number of phosphotransferase genes in the Clostridium beijerinckii NCIMB 8052 genome and the study on their evolution , 2010, BMC Bioinformatics.

[13]  Steven E. Brenner,et al.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures , 2013, Nucleic Acids Res..

[14]  Kuo-Chen Chou,et al.  Ensemble classifier for protein fold pattern recognition , 2006, Bioinform..

[15]  Shuigeng Zhou,et al.  A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation , 2009, Bioinform..

[16]  Dong Xu,et al.  FFAS-3D: improving fold recognition by including optimized structural features and template re-ranking , 2014, Bioinform..

[17]  Hampapathalu A. Nagarajaram,et al.  Support Vector Machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs , 2007, Bioinform..

[18]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[19]  Lukasz Kurgan,et al.  iFC2: an integrated web-server for improved prediction of protein structural class, fold type, and secondary structure content , 2010, Amino Acids.

[20]  George Karypis,et al.  Profile-based direct kernels for remote homology detection and fold recognition , 2005, Bioinform..

[21]  Lukasz A. Kurgan,et al.  PFRES: protein fold classification by using evolutionary information and predicted secondary structure , 2007, Bioinform..

[22]  C DeLisi,et al.  Estimating the number of protein folds. , 1998, Journal of molecular biology.

[23]  David A. Lee,et al.  CATH: comprehensive structural and functional annotations for genome sequences , 2014, Nucleic Acids Res..

[24]  K. Chou,et al.  Predicting protein fold pattern with functional domain and sequential evolution information. , 2009, Journal of theoretical biology.

[25]  Theodoros Damoulas,et al.  Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection , 2008, Bioinform..

[26]  Xieping Gao,et al.  A novel hierarchical ensemble classifier for protein fold recognition. , 2008, Protein engineering, design & selection : PEDS.

[27]  Yang Zhang,et al.  The I-TASSER Suite: protein structure and function prediction , 2014, Nature Methods.

[28]  P. Deschavanne,et al.  Enhanced protein fold recognition using a structural alphabet , 2009, Proteins.

[29]  Chuen-Der Huang,et al.  Hierarchical learning architecture with automatic feature selection for multiclass protein fold classification , 2003, IEEE Transactions on NanoBioscience.

[30]  Y-h. Taguchi,et al.  Application of amino acid occurrence for discriminating different folding types of globular proteins , 2007, BMC Bioinformatics.

[31]  sprotocols iFC^2: an integrated web-server for the improved prediction of protein fold type, structural class, and secondary structure content , 2015 .

[32]  Zu-Guo Yu,et al.  Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation. , 2009 .

[33]  S. Wold,et al.  DNA and peptide sequences and chemical processes multivariately modelled by principal component analysis and partial least-squares projections to latent structures , 1993 .

[34]  Johannes Söding,et al.  Protein homology detection by HMM?CHMM comparison , 2005, Bioinform..

[35]  Yves Moreau,et al.  Protein fold recognition using geometric kernel data fusion , 2014, Bioinform..

[36]  Xing Gao,et al.  Enhanced Protein Fold Prediction Method Through a Novel Feature Extraction Technique , 2015, IEEE Transactions on NanoBioscience.

[37]  D. T. Jones,et al.  A new approach to protein fold recognition , 1992, Nature.

[38]  E. Lindahl,et al.  Identification of related proteins on family, superfamily and fold level. , 2000, Journal of molecular biology.

[39]  Hong-Bin Shen,et al.  Protein folds recognized by an intelligent predictor based‐on evolutionary and structural information , 2016, J. Comput. Chem..

[40]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[41]  D T Jones,et al.  A systematic comparison of protein structure classifications: SCOP, CATH and FSSP. , 1999, Structure.