PhosphoSVM: prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector machine

Phosphorylation is one of the most essential post-translational modifications in eukaryotes. Studies on kinases and their substrates are important for understanding cellular signaling networks. Because of the cost in time and labor associated with large-scale wet-bench experiments, computational prediction of phosphorylation sites becomes important and many computational tools have been developed in the recent decades. The prediction tools can be grouped into two categories: kinase-specific and non-kinase-specific tools. With more kinases being discovered by the new sequencing technologies, accurate non-kinase-specific prediction tools are highly desirable for whole-genome annotation in a wider variety of species. In this manuscript, a support vector machine is used to combine eight different sequence level scoring functions to predict phosphorylation sites. The attributes used by this work, including Shannon entropy, relative entropy, predicted protein secondary structure, predicted protein disorder, solvent accessible area, overlapping properties, averaged cumulative hydrophobicity, and k-nearest neighbor, were able to obtain better results than the previously used attributes by other similar methods. This method achieved AUC values of 0.8405/0.8183/0.7383 for serine (S), threonine (T), and tyrosine (Y) phosphorylation sites, respectively, in animals with a tenfold cross-validation. The model trained by the animal phosphorylation sites was also applied to a plant phosphorylation site dataset as an independent test. The AUC values for the independent test dataset were 0.7761/0.6652/0.5958 for S/T/Y phosphorylation sites, which compared favorably with those of several existing methods. A web server based on our method was constructed for public use. The server, trained model, and all datasets used in the current study are available at http://sysbio.unl.edu/PhosphoSVM.

[1]  L. Iakoucheva,et al.  The importance of intrinsic disorder for protein phosphorylation. , 2004, Nucleic acids research.

[2]  Joachim Selbig,et al.  PhosPhAt: a database of phosphorylation sites in Arabidopsis thaliana and a plant-specific phosphorylation site predictor , 2007, Nucleic Acids Res..

[3]  Dong Xu,et al.  Computational Identification of Protein Methylation Sites through Bi-Profile Bayes Feature Extraction , 2009, PloS one.

[4]  Bermseok Oh,et al.  Prediction of phosphorylation sites using SVMs , 2004, Bioinform..

[5]  E. DeLong,et al.  Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. , 1988, Biometrics.

[6]  N. Blom,et al.  Statistical analysis of protein kinase specificity determinants , 1998, FEBS letters.

[7]  Yu Xue,et al.  PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory , 2006, BMC Bioinformatics.

[8]  Douglas L. Brutlag,et al.  Identification of Protein Motifs Using Conserved Amino Acid Properties and Partitioning Techniques , 1995, ISMB.

[9]  Yu Shyr,et al.  Improved prediction of lysine acetylation by support vector machines. , 2009, Protein and peptide letters.

[10]  D. Eisenberg,et al.  Correlation of sequence hydrophobicities measures similarity in three-dimensional protein structure. , 1983, Journal of molecular biology.

[11]  D. Hardie,et al.  Evidence for a protein kinase cascade in higher plants. 3-Hydroxy-3-methylglutaryl-CoA reductase kinase. , 1992, European journal of biochemistry.

[12]  Søren Brunak,et al.  Prediction of Glycosylation Across the Human Proteome and the Correlation to Protein Function , 2001, Pacific Symposium on Biocomputing.

[13]  Yu Xue,et al.  GPS 2.1: enhanced prediction of kinase-specific phosphorylation sites with an algorithm of motif length selection. , 2011, Protein engineering, design & selection : PEDS.

[14]  Yao Chi Chen,et al.  Hidden relationship between conserved residues and locally conserved phosphate-binding structures in NAD(P)-binding proteins. , 2012, The journal of physical chemistry. B.

[15]  Aleksey A. Porollo,et al.  Enhanced prediction of conformational flexibility and phosphorylation in proteins. , 2010, Advances in experimental medicine and biology.

[16]  Fredrik Johansson,et al.  A comparative study of conservation and variation scores , 2010, BMC Bioinformatics.

[17]  Lukasz A. Kurgan,et al.  Accurate sequence-based prediction of catalytic residues , 2008, Bioinform..

[18]  J. S. Sodhi,et al.  Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. , 2004, Journal of molecular biology.

[19]  T. Hunter,et al.  The mouse kinome: discovery and comparative genomics of all mouse protein kinases. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[20]  James V. Candy,et al.  Adaptive and Learning Systems for Signal Processing, Communications, and Control , 2006 .

[21]  Jonathan D. Hirst,et al.  Prediction of glycosylation sites using random forests , 2008, BMC Bioinformatics.

[22]  Mona Singh,et al.  Predicting functionally important residues from sequence conservation , 2007, Bioinform..

[23]  A. T. Özcerit,et al.  Prediction of MHC class I binding peptides with a new feature encoding technique. , 2012, Cellular immunology.

[24]  Yu Xue,et al.  A summary of computational resources for protein phosphorylation. , 2010, Current protein & peptide science.

[25]  P. Y. Chou,et al.  Conformational parameters for amino acids in helical, beta-sheet, and random coil regions calculated from proteins. , 1974, Biochemistry.

[26]  B. Turk,et al.  A versatile strategy to define the phosphorylation preferences of plant protein kinases and screen for putative substrates. , 2008, The Plant journal : for cell and molecular biology.

[27]  Subhadip Basu,et al.  AMS 3.0: prediction of post-translational modifications , 2010, BMC Bioinformatics.

[28]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[29]  Nikolaj Blom,et al.  PhosphoBase, a database of phosphorylation sites: release 2.0 , 1999, Nucleic Acids Res..

[30]  N. Blom,et al.  Identification of phosphorylation sites in protein kinase A substrates using artificial neural networks and mass spectrometry. , 2004, Journal of proteome research.

[31]  Francisco Melo,et al.  StAR: a simple tool for the statistical comparison of ROC curves , 2008, BMC Bioinformatics.

[32]  Dong Xu,et al.  Musite, a Tool for Global Prediction of General and Kinase-specific Phosphorylation Sites* , 2010, Molecular & Cellular Proteomics.

[33]  Monika Zulawski,et al.  PhosPhAt goes kinases—searchable protein kinase target information in the plant phosphorylation site database PhosPhAt , 2012, Nucleic Acids Res..

[34]  Xiaoqi Zheng,et al.  Prediction of catalytic residues based on an overlapping amino acid classification , 2010, Amino Acids.

[35]  Allegra Via,et al.  Phospho.ELM: a database of phosphorylation sites—update 2008 , 2007, Nucleic Acids Res..

[36]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[37]  G. Bologna,et al.  N‐Terminal myristoylation predictions by ensembles of neural networks , 2004, Proteomics.

[38]  W. Taylor,et al.  The classification of amino acid conservation. , 1986, Journal of theoretical biology.

[39]  Robert Schmidt,et al.  PhosPhAt: the Arabidopsis thaliana phosphorylation site database. An update , 2009, Nucleic Acids Res..

[40]  T. Hunter,et al.  The Protein Kinase Complement of the Human Genome , 2002, Science.

[41]  Liam J. McGuffin,et al.  The PSIPRED protein structure prediction server , 2000, Bioinform..

[42]  Weifeng Liu,et al.  Adaptive and Learning Systems for Signal Processing, Communication, and Control , 2010 .

[43]  Anthony J. Kusalik,et al.  Computational prediction of eukaryotic phosphorylation sites , 2011, Bioinform..

[44]  S. Brunak,et al.  Prediction, conservation analysis, and structural characterization of mammalian mucin-type O-glycosylation sites. , 2005, Glycobiology.

[45]  Jun Wang,et al.  L1pred: A Sequence-Based Prediction Tool for Catalytic Residues in Enzymes with the L1-logreg Classifier , 2012, PloS one.

[46]  Shandar Ahmad,et al.  RVP-net: online prediction of real valued accessible surface area of proteins from single sequences , 2003, Bioinform..

[47]  N. Blom,et al.  Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. , 1999, Journal of molecular biology.

[48]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[49]  Nikolaj Blom,et al.  Prediction of proprotein convertase cleavage sites. , 2004, Protein engineering, design & selection : PEDS.

[50]  O. Lichtarge,et al.  A family of evolution-entropy hybrid methods for ranking protein residues by importance. , 2004, Journal of molecular biology.

[51]  Liangjiang Wang,et al.  BindN: a web-based tool for efficient prediction of DNA and RNA binding sites in amino acid sequences , 2006, Nucleic Acids Res..

[52]  Lucila Ohno-Machado,et al.  The use of receiver operating characteristic curves in biomedical informatics , 2005, J. Biomed. Informatics.

[53]  Ashis Kumer Biswas,et al.  Machine learning approach to predict protein phosphorylation sites by incorporating evolutionary information , 2010, BMC Bioinformatics.