An Empirical Study of Different Approaches for Protein Classification

Many domains would benefit from reliable and efficient systems for automatic protein classification. An area of particular interest in recent studies on automatic protein classification is the exploration of new methods for extracting features from a protein that work well for specific problems. These methods, however, are not generalizable and have proven useful in only a few domains. Our goal is to evaluate several feature extraction approaches for representing proteins by testing them across multiple datasets. Different types of protein representations are evaluated: those starting from the position specific scoring matrix of the proteins (PSSM), those derived from the amino-acid sequence, two matrix representations, and features taken from the 3D tertiary structure of the protein. We also test new variants of proteins descriptors. We develop our system experimentally by comparing and combining different descriptors taken from the protein representations. Each descriptor is used to train a separate support vector machine (SVM), and the results are combined by sum rule. Some stand-alone descriptors work well on some datasets but not on others. Through fusion, the different descriptors provide a performance that works well across all tested datasets, in some cases performing better than the state-of-the-art.

[1]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[2]  Matti Pietikäinen,et al.  Rotation Invariant Image Description with Local Binary Pattern Histogram Fourier Features , 2009, SCIA.

[3]  Ville Ojansivu,et al.  Blur Insensitive Texture Classification Using Local Phase Quantization , 2008, ICISP.

[4]  David A. Gough,et al.  Whole-proteome interaction mining , 2003, Bioinform..

[5]  Yu-Dong Cai,et al.  Prediction of Deleterious Non-Synonymous SNPs Based on Protein Interaction Network and Hybrid Properties , 2010, PloS one.

[6]  Xue-wen Chen,et al.  On Position-Specific Scoring Matrix for Protein Function Prediction , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[7]  James G. Lyons,et al.  A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition. , 2013, Journal of theoretical biology.

[8]  Kuo-Chen Chou,et al.  A New Method for Predicting the Subcellular Localization of Eukaryotic Proteins with Both Single and Multiple Sites: Euk-mPLoc 2.0 , 2010, PloS one.

[9]  Z. Huang,et al.  Using cellular automata images and pseudo amino acid composition to predict protein subcellular location , 2005, Amino Acids.

[10]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[11]  Kuo-Chen Chou,et al.  Using grey dynamic modeling and pseudo amino acid composition to predict protein structural classes , 2008, J. Comput. Chem..

[12]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[13]  Hui Ding,et al.  Predicting ion channels and their types by the dipeptide mode of pseudo amino acid composition. , 2011, Journal of theoretical biology.

[14]  Chuan Yi Tang,et al.  Feature Selection and Combination Criteria for Improving Accuracy in Protein Structure Prediction , 2007, IEEE Transactions on NanoBioscience.

[15]  Zheng Rong Yang,et al.  Bio-basis function neural network for prediction of protease cleavage sites in proteins , 2005, IEEE Transactions on Neural Networks.

[16]  David Haussler,et al.  Using the Fisher Kernel Method to Detect Remote Protein Homologies , 1999, ISMB.

[17]  Nello Cristianini,et al.  An Introduction to Support Vector Machines and Other Kernel-based Learning Methods , 2000 .

[18]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[19]  Jiawei Luo,et al.  Protein functional class prediction using global encoding of amino acid sequence. , 2009, Journal of theoretical biology.

[20]  Mourad Elloumi,et al.  Encoding of primary structures of biological macromolecules within a data mining perspective , 2008, Journal of Computer Science and Technology.

[21]  Dinesh Gupta,et al.  VirulentPred: a SVM based prediction method for virulent proteins in bacterial pathogens , 2008, BMC Bioinformatics.

[22]  Ethem Alpaydin,et al.  Cost-conscious comparison of supervised learning algorithms over multiple data sets , 2012, Pattern Recognit..

[23]  Kuo-Chen Chou,et al.  MemType-2L: a web server for predicting membrane proteins and their types by incorporating evolution information through Pse-PSSM. , 2007, Biochemical and biophysical research communications.

[24]  A. D. McLachlan,et al.  Profile analysis: detection of distantly related proteins. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[25]  Loris Nanni,et al.  High performance set of PseAAC and sequence based descriptors for protein classification. , 2010, Journal of theoretical biology.

[26]  Xiaoqi Zheng,et al.  Predicting subcellular location of apoptosis proteins with pseudo amino acid composition: approach from amino acid substitution matrix and auto covariance transformation , 2012, Amino Acids.

[27]  Jianqin Zhou,et al.  On discrete cosine transform , 2011, ArXiv.

[28]  Jian Guo,et al.  A novel method for protein subcellular localization: Combining residue-couple model and SVM , 2005, APBC.

[29]  Christian von Mering,et al.  STRING 8—a global view on proteins and their functional interactions in 630 organisms , 2008, Nucleic Acids Res..

[30]  Kuo-Chen Chou,et al.  GPCR‐CA: A cellular automaton image approach for predicting G‐protein–coupled receptor functional classes , 2009, J. Comput. Chem..

[31]  Kuo-Bin Li,et al.  Predicting membrane protein types by incorporating protein topology, domains, signal peptides, and physicochemical properties into the general form of Chou's pseudo amino acid composition. , 2013, Journal of theoretical biology.

[32]  Xia Wang,et al.  Predicting the state of cysteines based on sequence information. , 2010, Journal of theoretical biology.

[33]  Jason Weston,et al.  Mismatch string kernels for discriminative protein classification , 2004, Bioinform..

[34]  Shekhar C. Mande,et al.  Differential enrichment of regulatory motifs in the composite network of protein-protein and gene regulatory interactions , 2014, BMC Systems Biology.

[35]  A Chinnasamy,et al.  Protein structure and fold prediction using tree-augmented naive Bayesian classifier. , 2004, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[36]  R. Levy,et al.  Simplified amino acid alphabets for protein fold recognition and implications for folding. , 2000, Protein engineering.

[37]  Loris Nanni,et al.  An ensemble of K-local hyperplanes for predicting protein-protein interactions , 2006, Bioinform..

[38]  Gene H. Golub,et al.  Singular value decomposition and least squares solutions , 1970, Milestones in Matrix Computation.

[39]  Robert P. W. Duin,et al.  Approximating the multiclass ROC by pairwise analysis , 2007, Pattern Recognit. Lett..

[40]  Nathan Linial,et al.  Generative probabilistic models for protein–protein interaction networks—the biclique perspective , 2011, Bioinform..

[41]  Josef Stoer,et al.  Numerische Mathematik 1 , 1989 .

[42]  Chao Wang,et al.  ProClusEnsem: Predicting membrane protein types by fusing different modes of pseudo amino acid composition , 2012, Comput. Biol. Medicine.

[43]  Zeng-Chang Qin,et al.  ROC analysis for predictions made by probabilistic classifiers , 2005, 2005 International Conference on Machine Learning and Cybernetics.

[44]  Kuo-Chen Chou,et al.  Signal-CF: a subsite-coupled and window-fusing approach for predicting signal peptides. , 2007, Biochemical and biophysical research communications.

[45]  Chuen-Der Huang,et al.  Hierarchical learning architecture with automatic feature selection for multiclass protein fold classification , 2003, IEEE Transactions on NanoBioscience.

[46]  Loris Nanni,et al.  An empirical study on the matrix-based protein representations and their combination with sequence-based approaches , 2012, Amino Acids.

[47]  Loris Nanni,et al.  Local Phase Quantization Texture Descriptor for Protein Classification , 2010, BIOCOMP.

[48]  Jean-Loup Faulon,et al.  Predicting protein-protein interactions using signature products , 2005, Bioinform..

[49]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[50]  Wei Chen,et al.  Using Over-Represented Tetrapeptides to Predict Protein Submitochondria Locations , 2013, Acta Biotheoretica.

[51]  C. Schiraldi,et al.  Plasma lactate, GH and GH-binding protein levels in exercise following BCAA supplementation in athletes , 2001, Amino Acids.

[52]  Tom Fawcett,et al.  ROC Graphs: Notes and Practical Considerations for Researchers , 2007 .

[53]  Juan José Rodríguez Diez,et al.  Rotation Forest: A New Classifier Ensemble Method , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Jian-Yu Shi,et al.  Using Texture Descriptor and Radon Transform to Characterize Protein Structure and Build Fast Fold Recognition , 2009, 2009 International Association of Computer Science and Information Technology - Spring Conference.

[55]  Gajendra P S Raghava,et al.  Prediction of Mitochondrial Proteins Using Support Vector Machine and Hidden Markov Model* , 2006, Journal of Biological Chemistry.

[56]  Loris Nanni,et al.  Wavelet images and Chou’s pseudo amino acid composition for protein classification , 2011, Amino Acids.

[57]  Yang Dai,et al.  An SVM-based system for predicting protein subnuclear localizations , 2005, BMC Bioinformatics.

[58]  Yanda Li,et al.  Prediction of protein submitochondria locations by hybridizing pseudo-amino acid composition with various physicochemical features of segmented sequence , 2006, BMC Bioinformatics.

[59]  Kuo-Bin Li,et al.  AAIndexLoc: predicting subcellular localization of proteins based on a new representation of sequences using amino acid indices , 2008, Amino Acids.

[60]  Qian-zhong Li,et al.  Predicting protein submitochondria locations by combining different descriptors into the general form of Chou’s pseudo amino acid composition , 2011, Amino Acids.

[61]  K. Chou,et al.  Virus-PLoc: a fusion classifier for predicting the subcellular localization of viral proteins within host and virus-infected cells. , 2007, Biopolymers.

[62]  Kuo-Chen Chou,et al.  GPCR-GIA: a web-server for identifying G-protein coupled receptors and their families with grey incidence analysis. , 2009, Protein engineering, design & selection : PEDS.

[63]  Li Yang,et al.  Using auto covariance method for functional discrimination of membrane proteins based on evolution information , 2009, Amino Acids.

[64]  Engelbert Mephu Nguifo,et al.  Protein sequences classification by means of feature extraction with substitution matrices , 2010, BMC Bioinformatics.

[65]  K Nishikawa,et al.  The folding type of a protein is relevant to the amino acid composition. , 1986, Journal of biochemistry.

[66]  Hongbin Shen,et al.  Large-scale prediction of human protein-protein interactions from amino acid sequence based on latent topic features. , 2010, Journal of proteome research.

[67]  Z. Huang,et al.  Using complexity measure factor to predict protein subcellular location , 2005, Amino Acids.

[68]  K. Chou,et al.  Recent progress in protein subcellular location prediction. , 2007, Analytical biochemistry.

[69]  L. Nanni,et al.  Protein classification combining surface analysis and primary structure. , 2009, Protein engineering, design & selection : PEDS.

[70]  Yanzhi Guo,et al.  Predicting DNA-binding proteins: approached from Chou’s pseudo amino acid composition and other specific sequence features , 2007, Amino Acids.

[71]  Ian M. Donaldson,et al.  Effects of protein interaction data integration, representation and reliability on the use of network properties for drug target prediction , 2012, BMC Bioinformatics.

[72]  K. Chou,et al.  Gpos-PLoc: an ensemble classifier for predicting subcellular localization of Gram-positive bacterial proteins. , 2007, Protein engineering, design & selection : PEDS.

[73]  Yanzhi Guo,et al.  Using the augmented Chou's pseudo amino acid composition for predicting protein submitochondria locations based on auto covariance approach. , 2009, Journal of theoretical biology.

[74]  K. Chou Pseudo Amino Acid Composition and its Applications in Bioinformatics, Proteomics and System Biology , 2009 .

[75]  Fengmin Li,et al.  Predicting protein subcellular location using Chou's pseudo amino acid composition and improved hybrid approach. , 2008, Protein and peptide letters.