Identification of protein functions using a machine-learning approach based on sequence-derived properties

BackgroundPredicting the function of an unknown protein is an essential goal in bioinformatics. Sequence similarity-based approaches are widely used for function prediction; however, they are often inadequate in the absence of similar sequences or when the sequence similarity among known protein sequences is statistically weak. This study aimed to develop an accurate prediction method for identifying protein function, irrespective of sequence and structural similarities.ResultsA highly accurate prediction method capable of identifying protein function, based solely on protein sequence properties, is described. This method analyses and identifies specific features of the protein sequence that are highly correlated with certain protein functions and determines the combination of protein sequence features that best characterises protein function. Thirty-three features that represent subtle differences in local regions and full regions of the protein sequences were introduced. On the basis of 484 features extracted solely from the protein sequence, models were built to predict the functions of 11 different proteins from a broad range of cellular components, molecular functions, and biological processes. The accuracy of protein function prediction using random forests with feature selection ranged from 94.23% to 100%. The local sequence information was found to have a broad range of applicability in predicting protein function.ConclusionWe present an accurate prediction method using a machine-learning approach based solely on protein sequence properties. The primary contribution of this paper is to propose new PNPRD features representing global and/or local differences in sequences, based on positively and/or negatively charged residues, to assist in predicting protein function. In addition, we identified a compact and useful feature subset for predicting the function of various proteins. Our results indicate that sequence-based classifiers can provide good results among a broad range of proteins, that the proposed features are useful in predicting several functions, and that the combination of our and traditional features may support the creation of a discriminative feature set for specific protein functions.

[1]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[2]  N. Blom,et al.  Feature-based prediction of non-classical and leaderless protein secretion. , 2004, Protein engineering, design & selection : PEDS.

[3]  Y.Z. Chen,et al.  Enzyme family classification by support vector machines , 2004, Proteins.

[4]  Xin Li,et al.  Protein classification with imbalanced data , 2007, Proteins.

[5]  Alex Alves Freitas,et al.  Optimizing amino acid groupings for GPCR classification , 2008, Bioinform..

[6]  Thomas Lengauer Bioinformatics : from genomes to therapies , 2007 .

[7]  Vasant Honavar,et al.  Glycosylation site prediction using ensembles of Support Vector Machine classifiers , 2007, BMC Bioinformatics.

[8]  S. Benner,et al.  Functional inferences from reconstructed evolutionary biology involving rectified databases--an evolutionarily grounded approach to functional genomics. , 2000, Research in microbiology.

[9]  J. Beckwith,et al.  Determinants of membrane protein topology. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Yung-Seop Lee,et al.  Enriched random forests , 2008, Bioinform..

[11]  Jenn-Kang Hwang,et al.  Prediction of the bonding states of cysteines Using the support vector machines based on multiple feature vectors and cysteine state sequences , 2004, Proteins.

[12]  Chris H. Q. Ding,et al.  Multi-class protein fold recognition using support vector machines and neural networks , 2001, Bioinform..

[13]  P. Dobson,et al.  Distinguishing enzyme structures from non-enzymes without alignments. , 2003, Journal of molecular biology.

[14]  Y. Z. Chen,et al.  Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach , 2004, Nucleic acids research.

[15]  Dariya S. Glazer,et al.  The FEATURE framework for protein function annotation: modeling new functions, improving performance, and extending to novel applications , 2008, BMC Genomics.

[16]  Stephen C. J. Parker,et al.  Towards the identification of essential genes using targeted genome sequencing and comparative analysis , 2006, BMC Genomics.

[17]  C. Sander,et al.  Dali: a network tool for protein structure comparison. , 1995, Trends in biochemical sciences.

[18]  G. von Heijne,et al.  Different positively charged amino acids have similar effects on the topology of a polytopic transmembrane protein in Escherichia coli. , 1992, The Journal of biological chemistry.

[19]  Rich Caruana,et al.  Benefitting from the Variables that Variable Selection Discards , 2003, J. Mach. Learn. Res..

[20]  P D Karp,et al.  What we do not know about sequence analysis and sequence databases. , 1998, Bioinformatics.

[21]  Dinesh Gupta,et al.  VirulentPred: a SVM based prediction method for virulent proteins in bacterial pathogens , 2008, BMC Bioinformatics.

[22]  Ao Li,et al.  LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST , 2005, Nucleic Acids Res..

[23]  P. Dobson,et al.  Predicting enzyme class from protein structure without alignments. , 2005, Journal of molecular biology.

[24]  William R. Taylor,et al.  Structure Comparison and Structure Patterns , 2000, J. Comput. Biol..

[25]  C. Hoogland,et al.  In The Proteomics Protocols Handbook , 2005 .

[26]  John M. Walker,et al.  The Proteomics Protocols Handbook , 2005, Humana Press.

[27]  Luhua Lai,et al.  Prediction of potential drug targets based on simple sequence properties , 2007, BMC Bioinformatics.

[28]  Zhen Li,et al.  A comparison of machine learning algorithms for chemical toxicity classification using a simulated multi-scale data model , 2008, BMC Bioinformatics.

[29]  R D Appel,et al.  Protein identification and analysis tools in the ExPASy server. , 1999, Methods in molecular biology.

[30]  B. Hogue,et al.  Identification of Functionally Important Negatively Charged Residues in the Carboxy End of Mouse Hepatitis Coronavirus A59 Nucleocapsid Protein , 2006, Journal of Virology.

[31]  Alex Alves Freitas,et al.  On the hierarchical classification of G protein-coupled receptors , 2007, Bioinform..

[32]  Y. Z. Chen,et al.  Protein function classification via support vector machine approach. , 2003, Mathematical biosciences.

[33]  Claude Pasquier,et al.  PRED‐CLASS: Cascading neural networks for generalized protein classification and genome‐wide applications , 2001, Proteins.

[34]  U. Hobohm,et al.  A sequence property approach to searching protein databases. , 1995, Journal of molecular biology.

[35]  D. Lipman,et al.  Improved tools for biological sequence comparison. , 1988, Proceedings of the National Academy of Sciences of the United States of America.

[36]  Søren Brunak,et al.  Prediction of novel archaeal enzymes from sequence‐derived features , 2002, Protein science : a publication of the Protein Society.

[37]  Yanay Ofran,et al.  Proteins of the same fold and unrelated sequences have similar amino acid composition , 2006, Proteins.

[38]  T. Lundstedt,et al.  Classification of G‐protein coupled receptors by alignment‐independent extraction of principal chemical properties of primary amino acid sequences , 2002, Protein science : a publication of the Protein Society.

[39]  Qing Yang,et al.  The combination approach of SVM and ECOC for powerful identification and classification of transcription factor , 2008, BMC Bioinformatics.

[40]  Achim Zeileis,et al.  BMC Bioinformatics BioMed Central Methodology article Conditional variable importance for random forests , 2008 .

[41]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[42]  Annabel E. Todd,et al.  Evolution of function in protein superfamilies, from a structural perspective. , 2001, Journal of molecular biology.

[43]  Iddo Friedberg,et al.  Automated protein function predictionçthe genomic challenge , 2006 .

[44]  Ponnuthurai N. Suganthan,et al.  A machine learning approach for the identification of odorant binding proteins from sequence-derived properties , 2007, BMC Bioinformatics.

[45]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[46]  A. Tramontano,et al.  Classification of proteins based on the properties of the ligand‐binding site: The case of adenine‐binding proteins , 2002, Proteins.

[47]  C. A. Andersen,et al.  Prediction of human protein function from post-translational modifications and localization features. , 2002, Journal of molecular biology.

[48]  Ali Al-Shahib,et al.  Franksum: new feature selection method for protein function prediction , 2005, Int. J. Neural Syst..

[49]  Shinn-Ying Ho,et al.  Computational identification of ubiquitylation sites from protein sequences , 2008, BMC Bioinformatics.

[50]  Umar Syed,et al.  Enzyme function prediction with interpretable models. , 2009, Methods in molecular biology.

[51]  William Stafford Noble,et al.  Integrating Information for Protein Function Prediction , 2008 .

[52]  Gajendra P. S. Raghava,et al.  Identification of DNA-binding proteins using support vector machines and evolutionary profiles , 2007, BMC Bioinformatics.

[53]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[54]  Takeshi Kawabata,et al.  MATRAS: a program for protein 3D structure comparison , 2003, Nucleic Acids Res..

[55]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[56]  Daisuke Kihara,et al.  Function Prediction of uncharacterized proteins , 2007, J. Bioinform. Comput. Biol..

[57]  Goran Neshich,et al.  Predicting enzyme class from protein structure using Bayesian classification. , 2006, Genetics and molecular research : GMR.

[58]  Vasant Honavar,et al.  Automated data-driven discovery of motif-based protein function classifiers , 2003, Inf. Sci..

[59]  Kevin Burrage,et al.  Prediction of protein solvent accessibility using support vector machines , 2002, Proteins.

[60]  Ali Al-Shahib,et al.  Feature Selection and the Class Imbalance Problem in Predicting Protein Function from Sequence , 2005, Applied bioinformatics.

[61]  M J Sternberg,et al.  Recognition of analogous and homologous protein folds--assessment of prediction success and associated alignment accuracy using empirical substitution matrices. , 1998, Protein engineering.

[62]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[63]  Constantin F. Aliferis,et al.  A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification , 2008, BMC Bioinformatics.

[64]  Igor V. Tetko,et al.  Gene selection from microarray data for cancer classification - a machine learning approach , 2005, Comput. Biol. Chem..

[65]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[66]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[67]  Guangtao Ge,et al.  Classification of premalignant pancreatic cancer mass-spectrometry data using decision tree ensembles , 2008, BMC Bioinformatics.

[68]  Rolf Apweiler,et al.  The SWISS-PROT protein sequence data bank and its supplement TrEMBL , 1997, Nucleic Acids Res..

[69]  H. Kitano,et al.  Computational systems biology , 2002, Nature.

[70]  Geoff Holmes,et al.  Benchmarking Attribute Selection Techniques for Discrete Class Data Mining , 2003, IEEE Trans. Knowl. Data Eng..

[71]  LarrañagaPedro,et al.  A review of feature selection techniques in bioinformatics , 2007 .

[72]  Philip S. Yu Editorial: New AE Introduction , 2003, IEEE Trans. Knowl. Data Eng..

[73]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[74]  Bruschi,et al.  Classification of , 2010 .

[75]  X. Chen,et al.  SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence , 2003, Nucleic Acids Res..

[76]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[77]  Masaru Tomita,et al.  Proteome-Wide Prediction of Novel DNA/RNA-Binding Proteins Using Amino Acid Composition and Periodicity in the Hyperthermophilic Archaeon Pyrococcus furiosus , 2007, DNA research : an international journal for rapid publication of reports on genes and genomes.

[78]  Rolf Apweiler,et al.  The EBI SRS server-new features , 2002, Bioinform..

[79]  Margarita Salas,et al.  A positively charged residue of phi29 DNA polymerase, highly conserved in DNA polymerases from families A and B, is involved in binding the incoming nucleotide. , 2002, Nucleic acids research.

[80]  Vipin Kumar,et al.  Introduction to Data Mining , 2022, Data Mining and Machine Learning Applications.

[81]  Hans-Peter Kriegel,et al.  Protein function prediction via graph kernels , 2005, ISMB.

[82]  B. Rost Twilight zone of protein sequence alignments. , 1999, Protein engineering.

[83]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[84]  B Henrissat,et al.  Specificity mapping of cellulolytic enzymes: Classification into families of structurally related proteins confirmed by biochemical analysis , 1992, Protein science : a publication of the Protein Society.

[85]  Mei Liu,et al.  Prediction of protein-protein interactions using random decision forest framework , 2005, Bioinform..

[86]  O. Troyanskaya,et al.  Predicting gene function in a hierarchical context with an ensemble of classifiers , 2008, Genome Biology.

[87]  R. Schülein,et al.  A Single Negatively Charged Residue Affects the Orientation of a Membrane Protein in the Inner Membrane of Escherichia coliOnly When It Is Located Adjacent to a Transmembrane Domain* , 1999, The Journal of Biological Chemistry.