A Segmentation-Based Method to Extract Structural and Evolutionary Features for Protein Fold Recognition

Protein fold recognition (PFR) is considered as an important step towards the protein structure prediction problem. Despite all the efforts that have been made so far, finding an accurate and fast computational approach to solve the PFR still remains a challenging problem for bioinformatics and computational biology. In this study, we propose the concept of segmented-based feature extraction technique to provide local evolutionary information embedded in position specific scoring matrix (PSSM) and structural information embedded in the predicted secondary structure of proteins using SPINE-X. We also employ the concept of occurrence feature to extract global discriminatory information from PSSM and SPINE-X. By applying a support vector machine (SVM) to our extracted features, we enhance the protein fold prediction accuracy for 7.4 percent over the best results reported in the literature. We also report 73.8 percent prediction accuracy for a data set consisting of proteins with less than 25 percent sequence similarity rates and 80.7 percent prediction accuracy for a data set with proteins belonging to 110 folds with less than 40 percent sequence similarity rates. We also investigate the relation between the number of folds and the number of features being used and show that the number of features should be increased to get better protein fold prediction results when the number of folds is relatively large.

[1]  Y-h. Taguchi,et al.  Application of amino acid occurrence for discriminating different folding types of globular proteins , 2007, BMC Bioinformatics.

[2]  Abdollah Dehzangi,et al.  Using Random Forest for Protein Fold Prediction Problem: An Empirical Study , 2010, J. Inf. Sci. Eng..

[3]  Yuehui Chen,et al.  Ensemble of Probabilistic Neural Networks for Protein Fold Recognition , 2007, 2007 IEEE 7th International Symposium on BioInformatics and BioEngineering.

[4]  Lukasz A. Kurgan,et al.  SPINE X: Improving protein secondary structure prediction by multistep learning coupled with prediction of solvent accessible surface area and backbone torsion angles , 2012, J. Comput. Chem..

[5]  Guido Bologna,et al.  A comparison study on protein fold recognition , 2002, Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02..

[6]  Yorgos Goletsis,et al.  Sequence-based protein structure prediction using a reduced state-space hidden Markov model , 2007, Comput. Biol. Medicine.

[7]  Kuldip K. Paliwal,et al.  Enhancing Protein Fold Prediction Accuracy Using Evolutionary and Structural Features , 2013, PRIB.

[8]  Theodoros Damoulas,et al.  Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection , 2008, Bioinform..

[9]  Kaizhu Huang,et al.  Enhanced protein fold recognition through a novel data integration approach , 2009, BMC Bioinformatics.

[10]  Somnuk Phon-Amnuaisuk,et al.  Using Rotation Forest for Protein Fold Prediction Problem: An Empirical Study , 2010, EvoBIO.

[11]  P. Deschavanne,et al.  Enhanced protein fold recognition using a structural alphabet , 2009, Proteins.

[12]  Jianyi Yang,et al.  Improving taxonomy‐based protein fold recognition by using global and local features , 2011, Proteins.

[13]  Chuen-Der Huang,et al.  Hierarchical learning architecture with automatic feature selection for multiclass protein fold classification , 2003, IEEE Transactions on NanoBioscience.

[14]  Shengli Zhang,et al.  High-accuracy prediction of protein structural class for low-similarity sequences based on predicted secondary structure. , 2011, Biochimie.

[15]  Abdollah Dehzangi,et al.  Fold prediction problem: the application of new physical and physicochemical-based features. , 2011, Protein and peptide letters.

[16]  Parviz Abdolmaleki,et al.  Novel hybrid method for the evaluation of parameters contributing in determination of protein structural classes. , 2007, Journal of theoretical biology.

[17]  Abdollah Dehzangi,et al.  Protein Fold Recognition Using Segmentation-Based Feature Extraction Model , 2013, ACIIDS.

[18]  Somnuk Phon-Amnuaisuk,et al.  Enhancing Protein Fold Prediction Accuracy Using an Ensemble of Different Classifiers , 2009, Aust. J. Intell. Inf. Process. Syst..

[19]  Chuan Yi Tang,et al.  Feature Selection and Combination Criteria for Improving Accuracy in Protein Structure Prediction , 2007, IEEE Transactions on NanoBioscience.

[20]  M. Michael Gromiha,et al.  Multiple Contact Network Is a Key Determinant to Protein Folding Rates , 2009, J. Chem. Inf. Model..

[21]  Shuigeng Zhou,et al.  A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation , 2009, Bioinform..

[22]  Johannes Söding,et al.  Protein sequence comparison and fold recognition: progress and good-practice benchmarking. , 2011, Current opinion in structural biology.

[23]  Kuo-Chen Chou,et al.  Ensemble classifier for protein fold pattern recognition , 2006, Bioinform..

[24]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[25]  Xiuzhen Hu,et al.  The recognition of 27-class protein folds: approached by increment of diversity based on multi-characteristic parameters. , 2009, Protein and peptide letters.

[26]  Chengqi Zhang,et al.  Margin-based ensemble classifier for protein fold recognition , 2011, Expert Syst. Appl..

[27]  K. Chou,et al.  Using LogitBoost classifier to predict protein structural classes. , 2006, Journal of theoretical biology.

[28]  A Chinnasamy,et al.  Protein structure and fold prediction using tree-augmented naive Bayesian classifier. , 2004, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[29]  Lukasz A. Kurgan,et al.  Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences , 2009, BMC Bioinformatics.

[30]  Azadeh Shakery,et al.  Protein Fold Pattern Recognition Using Bayesian Ensemble of RBF Neural Networks , 2009, 2009 International Conference of Soft Computing and Pattern Recognition.

[31]  K. Chou,et al.  Predicting protein fold pattern with functional domain and sequential evolution information. , 2009, Journal of theoretical biology.

[32]  James G. Lyons,et al.  A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition. , 2013, Journal of theoretical biology.

[33]  Kuldip K. Paliwal,et al.  A strategy to select suitable physicochemical attributes of amino acids for protein fold recognition , 2013, BMC Bioinformatics.

[34]  Lukasz A. Kurgan,et al.  PFRES: protein fold classification by using evolutionary information and predicted secondary structure , 2007, Bioinform..

[35]  Pooja Jain,et al.  Automatic structure classification of small proteins using random forest , 2010, BMC Bioinformatics.

[36]  Kuldip K. Paliwal,et al.  Exploring Potential Discriminatory Information Embedded in PSSM to Enhance Protein Structural Class Prediction Accuracy , 2013, PRIB.

[37]  Abdollah Dehzangi,et al.  Solving protein fold prediction problem using fusion of heterogeneous classifiers , 2011 .

[38]  Katarzyna Stapor,et al.  A hybrid discriminative/generative approach to protein fold recognition , 2012, Neurocomputing.

[39]  Arthur Zimek,et al.  A Study of Hierarchical and Flat Classification of Proteins , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[40]  Hampapathalu A. Nagarajaram,et al.  Support Vector Machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs , 2007, Bioinform..

[41]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[42]  Babak Nadjar Araabi,et al.  A protein fold classifier formed by fusing different modes of pseudo amino acid composition via PSSM , 2011, Comput. Biol. Chem..

[43]  Kalyanmoy Deb,et al.  Multiclass protein fold recognition using multiobjective evolutionary algorithms , 2004, 2004 Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[44]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[45]  Abdollah Dehzangi,et al.  A Combination of Feature Extraction Methods with an Ensemble of Different Classifiers for Protein Structural Class Prediction Problem , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[46]  Lukasz A. Kurgan,et al.  SCPRED: Accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences , 2008, BMC Bioinformatics.

[47]  Xiaoning Qian,et al.  Accurate prediction of protein structural classes using functional domains and predicted secondary structure sequences , 2012, Journal of biomolecular structure & dynamics.

[48]  Loris Nanni,et al.  An empirical study on the matrix-based protein representations and their combination with sequence-based approaches , 2012, Amino Acids.

[49]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[50]  Loris Nanni,et al.  Ensemble of classifiers for protein fold recognition , 2006, Neurocomputing.

[51]  N.R. Pal,et al.  Prediction of Protein Folds: Extraction of New Features, Dimensionality Reduction, and Fusion of Heterogeneous Classifiers , 2009, IEEE Transactions on NanoBioscience.

[52]  Lukasz Kurgan,et al.  iFC2: an integrated web-server for improved prediction of protein structural class, fold type, and secondary structure content , 2010, Amino Acids.

[53]  Babak Nadjar Araabi,et al.  Evidence theoretic protein fold classification based on the concept of hyperfold. , 2012, Mathematical biosciences.

[54]  Loris Nanni,et al.  High performance set of PseAAC and sequence based descriptors for protein classification. , 2010, Journal of theoretical biology.