A Combination of Feature Extraction Methods with an Ensemble of Different Classifiers for Protein Structural Class Prediction Problem

Better understanding of structural class of a given protein reveals important information about its overall folding type and its domain. It can also be directly used to provide critical information on general tertiary structure of a protein which has a profound impact on protein function determination and drug design. Despite tremendous enhancements made by pattern recognition-based approaches to solve this problem, it still remains as an unsolved issue for bioinformatics that demands more attention and exploration. In this study, we propose a novel feature extraction model that incorporates physicochemical and evolutionary-based information simultaneously. We also propose overlapped segmented distribution and autocorrelation-based feature extraction methods to provide more local and global discriminatory information. The proposed feature extraction methods are explored for 15 most promising attributes that are selected from a wide range of physicochemical-based attributes. Finally, by applying an ensemble of different classifiers namely, Adaboost.M1, LogitBoost, naive Bayes, multilayer perceptron (MLP), and support vector machine (SVM) we show enhancement of the protein structural class prediction accuracy for four popular benchmarks.

[1]  Somnuk Phon-Amnuaisuk,et al.  Protein Fold Prediction Problem Using Ensemble of Classifiers , 2009, ICONIP.

[2]  Parviz Abdolmaleki,et al.  Novel hybrid method for the evaluation of parameters contributing in determination of protein structural classes. , 2007, Journal of theoretical biology.

[3]  Angelo M Facchiano,et al.  Prediction of the protein structural class by specific peptide frequencies. , 2009, Biochimie.

[4]  Xiaoqi Zheng,et al.  Accurate prediction of protein structural class using auto covariance transformation of PSI-BLAST profiles , 2011, Amino Acids.

[5]  Yoav Freund,et al.  Experiments with a New Boosting Algorithm , 1996, ICML.

[6]  M. Michael Gromiha,et al.  Multiple Contact Network Is a Key Determinant to Protein Folding Rates , 2009, J. Chem. Inf. Model..

[7]  Shuigeng Zhou,et al.  A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation , 2009, Bioinform..

[8]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[9]  Y Cai,et al.  Prediction of protein structural classes by neural network. , 2000, Biochimie.

[10]  Lukasz A. Kurgan,et al.  Prediction of protein structural class using novel evolutionary collocation‐based sequence representation , 2008, J. Comput. Chem..

[11]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[12]  Lukasz A. Kurgan,et al.  Modular prediction of protein structural classes from sequences of twilight-zone identity with predicting sequences , 2009, BMC Bioinformatics.

[13]  อนิรุธ สืบสิงห์,et al.  Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[14]  Zheng Yuan,et al.  How good is prediction of protein structural class by the component‐coupled method? , 2000, Proteins.

[15]  Xin Chen,et al.  Prediction of protein structural classes for low-homology sequences based on predicted secondary structure , 2010, BMC Bioinformatics.

[16]  Guo-Ping Zhou,et al.  An Intriguing Controversy over Protein Structural Class Prediction , 1998, Journal of protein chemistry.

[17]  James G. Lyons,et al.  A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition. , 2013, Journal of theoretical biology.

[18]  Lukasz Kurgan,et al.  Prediction of protein structural class for the twilight zone sequences. , 2007, Biochemical and biophysical research communications.

[19]  Abdollah Dehzangi,et al.  Fold prediction problem: the application of new physical and physicochemical-based features. , 2011, Protein and peptide letters.

[20]  Z.-C. Li,et al.  Prediction of protein structure class by coupling improved genetic algorithm and support vector machine , 2008, Amino Acids.

[21]  Kuo-Chen Chou,et al.  Prediction of Protein Structural Classes by Support Vector Machines , 2002, Comput. Chem..

[22]  Kuo-Chen Chou,et al.  Boosting classifier for predicting protein domain structural class. , 2005, Biochemical and biophysical research communications.

[23]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[24]  Lukasz A. Kurgan,et al.  Prediction of structural classes for protein sequences and domains - Impact of prediction algorithms, sequence representation and homology, and test procedures on accuracy , 2006, Pattern Recognit..

[25]  Kuo-Chen Chou,et al.  Predicting protein structural class with AdaBoost Learner. , 2006, Protein and peptide letters.

[26]  Geoffrey I. Webb,et al.  MultiBoosting: A Technique for Combining Boosting and Wagging , 2000, Machine Learning.

[27]  Jun Wang,et al.  An information‐theoretic approach to the prediction of protein structural class , 2010, J. Comput. Chem..

[28]  C. Chothia,et al.  Structural patterns in globular proteins , 1976, Nature.

[29]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[30]  N.R. Pal,et al.  Prediction of Protein Folds: Extraction of New Features, Dimensionality Reduction, and Fusion of Heterogeneous Classifiers , 2009, IEEE Transactions on NanoBioscience.

[31]  Yu-Dong Cai,et al.  Support Vector Machines for predicting protein structural class , 2001, BMC Bioinformatics.

[32]  Peixiang Cai,et al.  Predicting protein structural class with pseudo-amino acid composition and support vector machine fusion network. , 2006, Analytical biochemistry.

[33]  Babak Nadjar Araabi,et al.  A protein fold classifier formed by fusing different modes of pseudo amino acid composition via PSSM , 2011, Comput. Biol. Chem..

[34]  Ganesan Pugalenthi,et al.  Predicting protein structural class by SVM with class-wise optimized features and decision probabilities. , 2008, Journal of theoretical biology.

[35]  M. Michael Gromiha,et al.  A Statistical Model for Predicting Protein Folding Rates from Amino Acid Sequence with Structural Class Information , 2005, J. Chem. Inf. Model..

[36]  Kuo-Chen Chou,et al.  Prediction protein structural classes with pseudo-amino acid composition: approximate entropy and hydrophobicity pattern. , 2008, Journal of theoretical biology.

[37]  K. Chou,et al.  Using LogitBoost classifier to predict protein structural classes. , 2006, Journal of theoretical biology.

[38]  Jian-Ding Qiu,et al.  Using support vector machines for prediction of protein structural classes based on discrete wavelet transform , 2009, J. Comput. Chem..

[39]  B. Efron,et al.  A Leisurely Look at the Bootstrap, the Jackknife, and , 1983 .

[40]  Deepak Kolippakkam,et al.  APDbase: Amino acid Physicochemical properties Database , 2005, Bioinformation.

[41]  Zong Dai,et al.  Prediction of protein structural classes by Chou’s pseudo amino acid composition: approached using continuous wavelet transform and principal component analysis , 2008, Amino Acids.

[42]  Scott Dick,et al.  Classifier ensembles for protein structural class prediction with varying homology. , 2006, Biochemical and biophysical research communications.

[43]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[44]  Somnuk Phon-Amnuaisuk,et al.  Enhancing Protein Fold Prediction Accuracy Using an Ensemble of Different Classifiers , 2009, Aust. J. Intell. Inf. Process. Syst..

[45]  Lukasz A. Kurgan,et al.  Prediction of Secondary Protein Structure Content from Primary Sequence Alone - A Feature Selection Based Approach , 2005, MLDM.

[46]  Lukasz A. Kurgan,et al.  Secondary structure-based assignment of the protein structural classes , 2008, Amino Acids.

[47]  Pooja Jain,et al.  Automatic structure classification of small proteins using random forest , 2010, BMC Bioinformatics.

[48]  Abdollah Dehzangi,et al.  Solving protein fold prediction problem using fusion of heterogeneous classifiers , 2011 .

[49]  Zu-Guo Yu,et al.  Prediction of protein structural classes by recurrence quantification analysis based on chaos game representation. , 2009 .

[50]  Lukasz A. Kurgan,et al.  SCPRED: Accurate prediction of protein structural class for sequences of twilight-zone similarity with predicting sequences , 2008, BMC Bioinformatics.

[51]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[52]  Ian Witten,et al.  Data Mining , 2000 .

[53]  Parviz Abdolmaleki,et al.  Novel two-stage hybrid neural discriminant model for predicting proteins structural classes. , 2007, Biophysical chemistry.

[54]  Thomas G. Dietterich Multiple Classifier Systems , 2000, Lecture Notes in Computer Science.

[55]  Kuo-Chen Chou,et al.  Prediction of protein structure classes with pseudo amino acid composition and fuzzy support vector machine network. , 2007, Protein and peptide letters.

[56]  Cangzhi Jia,et al.  A high-accuracy protein structural class prediction algorithm using predicted secondary structural information. , 2010, Journal of theoretical biology.

[57]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..