ProFold: Protein Fold Classification with Additional Structural Features and a Novel Ensemble Classifier

Protein fold classification plays an important role in both protein functional analysis and drug design. The number of proteins in PDB is very large, but only a very small part is categorized and stored in the SCOPe database. Therefore, it is necessary to develop an efficient method for protein fold classification. In recent years, a variety of classification methods have been used in many protein fold classification studies. In this study, we propose a novel classification method called proFold. We import protein tertiary structure in the period of feature extraction and employ a novel ensemble strategy in the period of classifier training. Compared with existing similar ensemble classifiers using the same widely used dataset (DD-dataset), proFold achieves 76.2% overall accuracy. Another two commonly used datasets, EDD-dataset and TG-dataset, are also tested, of which the accuracies are 93.2% and 94.3%, higher than the existing methods. ProFold is available to the public as a web-server.

[1]  Xing Gao,et al.  An Improved Protein Structural Classes Prediction Method by Incorporating Both Sequence and Structure Information , 2015, IEEE Transactions on NanoBioscience.

[2]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[3]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[4]  K. Chou,et al.  Hum-mPLoc: an ensemble classifier for large-scale human protein subcellular location prediction by incorporating samples with multiple sites. , 2007, Biochemical and biophysical research communications.

[5]  Jianyi Yang,et al.  Improving taxonomy‐based protein fold recognition by using global and local features , 2011, Proteins.

[6]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[7]  Kuo-Chen Chou,et al.  iPPBS-Opt: A Sequence-Based Ensemble Classifier for Identifying Protein-Protein Binding Sites by Optimizing Imbalanced Training Datasets , 2016, Molecules.

[8]  Ke Chen,et al.  PFP-RFSM: Protein fold prediction by using random forests and sequence motifs , 2013 .

[9]  Wei Chen,et al.  iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition , 2014, Nucleic acids research.

[10]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[11]  Chengqi Zhang,et al.  Margin-based ensemble classifier for protein fold recognition , 2011, Expert Syst. Appl..

[12]  Xiaolong Wang,et al.  iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach , 2016, Journal of biomolecular structure & dynamics.

[13]  Johannes Söding,et al.  The HHpred interactive server for protein homology detection and structure prediction , 2005, Nucleic Acids Res..

[14]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[15]  S. Sathiya Keerthi,et al.  Improvements to Platt's SMO Algorithm for SVM Classifier Design , 2001, Neural Computation.

[16]  Babak Nadjar Araabi,et al.  A protein fold classifier formed by fusing different modes of pseudo amino acid composition via PSSM , 2011, Comput. Biol. Chem..

[17]  Xin Wang,et al.  PseAAC-Builder: a cross-platform stand-alone program for generating various special Chou's pseudo-amino acid compositions. , 2012, Analytical biochemistry.

[18]  Kuo-Chen Chou,et al.  Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-Nearest Neighbor classifiers. , 2006, Journal of proteome research.

[19]  James G. Lyons,et al.  Protein fold recognition by alignment of amino acid residues using kernelized dynamic time warping. , 2014, Journal of theoretical biology.

[20]  Ying Wang,et al.  Predicting protein fold types by the general form of Chou's pseudo amino acid composition: approached from optimal feature extractions. , 2012, Protein and peptide letters.

[21]  Peer Bork,et al.  SMART 5: domains in the context of genomes and networks , 2005, Nucleic Acids Res..

[22]  M. Ashraf,et al.  The recognition of multi-class protein folds by adding average chemical shifts of secondary structure elements , 2015, Saudi journal of biological sciences.

[23]  Kuldip K. Paliwal,et al.  Improving protein fold recognition using the amalgamation of evolutionary-based and structural based information , 2014, BMC Bioinformatics.

[24]  K. Chou,et al.  PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. , 2014, Analytical biochemistry.

[25]  R. Ji,et al.  Improved and Promising Identification of Human MicroRNAs by Incorporating a High-Quality Negative Set , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[26]  Kuo-Chen Chou,et al.  iPPI-Esml: An ensemble classifier for identifying the interactions of proteins by incorporating their physicochemical properties and wavelet transforms into PseAAC. , 2015, Journal of theoretical biology.

[27]  Chuan Wang,et al.  DescFold: A web server for protein fold recognition , 2009, BMC Bioinformatics.

[28]  K. Chou,et al.  iRSpot-TNCPseAAC: Identify Recombination Spots with Trinucleotide Composition and Pseudo Amino Acid Components , 2014, International journal of molecular sciences.

[29]  Katarzyna Stapor,et al.  A hybrid discriminative/generative approach to protein fold recognition , 2012, Neurocomputing.

[30]  James G. Lyons,et al.  A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition. , 2013, Journal of theoretical biology.

[31]  João Gama,et al.  Functional Trees , 2001, Machine Learning.

[32]  Ren Long,et al.  iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition , 2016, Bioinform..

[33]  Hampapathalu A. Nagarajaram,et al.  Support Vector Machine-based classification of protein folds using the structural properties of amino acid residues and amino acid residue pairs , 2007, Bioinform..

[34]  Kuo-Chen Chou,et al.  Ensemble classifier for protein fold pattern recognition , 2006, Bioinform..

[35]  Kuldip K. Paliwal,et al.  A Tri-Gram Based Feature Extraction Technique Using Linear Probabilities of Position Specific Scoring Matrix for Protein Fold Recognition , 2014, IEEE Transactions on NanoBioscience.

[36]  Narmada Thanki,et al.  CDD: a conserved domain database for interactive domain family analysis , 2006, Nucleic Acids Res..

[37]  Shengli Zhang,et al.  Accurate prediction of protein structural classes by incorporating PSSS and PSSM into Chou's general PseAAC , 2015 .

[38]  Ying Xu,et al.  Raptor: Optimal Protein Threading by Linear Programming , 2003, J. Bioinform. Comput. Biol..

[39]  Kuo-Chen Chou,et al.  Predicting protein subcellular location by fusing multiple classifiers , 2006, Journal of cellular biochemistry.

[40]  Xing Gao,et al.  Enhanced Protein Fold Prediction Method Through a Novel Feature Extraction Technique , 2015, IEEE Transactions on NanoBioscience.

[41]  D. T. Jones,et al.  A new approach to protein fold recognition , 1992, Nature.

[42]  Hong-Bin Shen,et al.  Protein folds recognized by an intelligent predictor based‐on evolutionary and structural information , 2016, J. Comput. Chem..

[43]  Liang Kong,et al.  Predict protein structural class for low-similarity sequences by evolutionary difference information into the general form of Chou's pseudo amino acid composition. , 2014, Journal of theoretical biology.

[44]  Max G. Lagally,et al.  Atom Motion on Surfaces , 1993 .

[45]  Kuo-Chen Chou,et al.  pSuc-Lys: Predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. , 2016, Journal of theoretical biology.

[46]  K. Dill,et al.  The protein folding problem. , 1993, Annual review of biophysics.

[47]  Dong Xu,et al.  iPhos‐PseEvo: Identifying Human Phosphorylated Proteins by Incorporating Evolutionary Information into General PseAAC via Grey System Theory , 2017, Molecular informatics.

[48]  Theodoros Damoulas,et al.  Probabilistic multi-class multi-kernel learning: on protein fold recognition and remote homology detection , 2008, Bioinform..

[49]  K. Chou,et al.  iCar-PseCp: identify carbonylation sites in proteins by Monte Carlo sampling and incorporating sequence coupled effects into general PseAAC , 2016, Oncotarget.

[50]  Xieping Gao,et al.  A novel hierarchical ensemble classifier for protein fold recognition. , 2008, Protein engineering, design & selection : PEDS.

[51]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[52]  Pierre Baldi,et al.  A machine learning information retrieval approach to protein fold recognition. , 2006, Bioinformatics.

[53]  K. Chou,et al.  iACP: a sequence-based tool for identifying anticancer peptides , 2016, Oncotarget.

[54]  Guido Bologna,et al.  A comparison study on protein fold recognition , 2002, Proceedings of the 9th International Conference on Neural Information Processing, 2002. ICONIP '02..

[55]  Xiaolong Wang,et al.  repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects , 2015, Bioinform..

[56]  Wei Zhang,et al.  SP5: Improving Protein Fold Recognition by Using Torsion Angle Profiles and Profile-Based Gap Penalty Model , 2008, PloS one.

[57]  Dong-Sheng Cao,et al.  propy: a tool to generate various modes of Chou's PseAAC , 2013, Bioinform..

[58]  Somnuk Phon-Amnuaisuk,et al.  Using Rotation Forest for Protein Fold Prediction Problem: An Empirical Study , 2010, EvoBIO.

[59]  Kuo-Chen Chou,et al.  QuatIdent: a web server for identifying protein quaternary structural attribute by fusing functional domain and sequential evolution information. , 2009, Journal of proteome research.

[60]  K. Chou,et al.  Predicting protein fold pattern with functional domain and sequential evolution information. , 2009, Journal of theoretical biology.

[61]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[62]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[63]  Jacques Lapointe,et al.  Theoretical and experimental biology in one—A symposium in honour of Professor Kuo-Chen Chou’s 50th anniversary and Professor Richard Giegé’s 40th anniversary of their scientific careers , 2013 .

[64]  Ren Long,et al.  iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework , 2016, Bioinform..

[65]  Wei Chen,et al.  PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions , 2015, Bioinform..

[66]  Lukasz A. Kurgan,et al.  PFRES: protein fold classification by using evolutionary information and predicted secondary structure , 2007, Bioinform..

[67]  Abdollah Dehzangi,et al.  Using Random Forest for Protein Fold Prediction Problem: An Empirical Study , 2010, J. Inf. Sci. Eng..

[68]  James G. Lyons,et al.  Advancing the Accuracy of Protein Fold Recognition by Utilizing Profiles From Hidden Markov Models , 2015, IEEE Transactions on NanoBioscience.

[69]  Steven E. Brenner,et al.  SCOPe: Structural Classification of Proteins—extended, integrating SCOP and ASTRAL data and classification of new structures , 2013, Nucleic Acids Res..

[70]  James G. Lyons,et al.  Protein fold recognition using HMM-HMM alignment and dynamic programming. , 2016, Journal of theoretical biology.

[71]  Eibe Frank,et al.  Logistic Model Trees , 2003, Machine Learning.

[72]  Q. Zou,et al.  Hierarchical Classification of Protein Folds Using a Novel Ensemble Classifier , 2013, PloS one.

[73]  Kuo-Chen Chou,et al.  iSuc-PseOpt: Identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset. , 2016, Analytical biochemistry.

[74]  H.-B. Shen,et al.  Euk-PLoc: an ensemble classifier for large-scale eukaryotic protein subcellular location prediction , 2007, Amino Acids.

[75]  K. Chou,et al.  Gpos-PLoc: an ensemble classifier for predicting subcellular localization of Gram-positive bacterial proteins. , 2007, Protein engineering, design & selection : PEDS.

[76]  Kuo-Chen Chou,et al.  Signal-CF: a subsite-coupled and window-fusing approach for predicting signal peptides. , 2007, Biochemical and biophysical research communications.

[77]  P. Deschavanne,et al.  Enhanced protein fold recognition using a structural alphabet , 2009, Proteins.

[78]  A G Murzin,et al.  SCOP: a structural classification of proteins database for the investigation of sequences and structures. , 1995, Journal of molecular biology.

[79]  Robert D. Finn,et al.  Pfam: clans, web tools and services , 2005, Nucleic Acids Res..

[80]  I. Muchnik,et al.  Recognition of a protein fold in the context of the SCOP classification , 1999 .

[81]  Xiuzhen Hu,et al.  Recognition of 27-Class Protein Folds by Adding the Interaction of Segments and Motif Information , 2014, BioMed research international.

[82]  Chris Sander,et al.  Removing near-neighbour redundancy from large protein sequence collections , 1998, Bioinform..

[83]  Pat Langley,et al.  Estimating Continuous Distributions in Bayesian Classifiers , 1995, UAI.

[84]  Kuldip K. Paliwal,et al.  A Segmentation-Based Method to Extract Structural and Evolutionary Features for Protein Fold Recognition , 2014, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[85]  Juan José Rodríguez Diez,et al.  Rotation Forest: A New Classifier Ensemble Method , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[86]  Shuigeng Zhou,et al.  A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation , 2009, Bioinform..

[87]  James G. Lyons,et al.  Probabilistic expression of spatially varied amino acid dimers into general form of Chou׳s pseudo amino acid composition for protein fold recognition. , 2015, Journal of theoretical biology.

[88]  Darren A. Natale,et al.  The COG database: an updated version includes eukaryotes , 2003, BMC Bioinformatics.

[89]  I. Muchnik,et al.  Prediction of protein folding class using global description of amino acid sequence. , 1995, Proceedings of the National Academy of Sciences of the United States of America.