Machine learning approach to predict protein phosphorylation sites by incorporating evolutionary information

BackgroundMost of the existing in silico phosphorylation site prediction systems use machine learning approach that requires preparing a good set of classification data in order to build the classification knowledge. Furthermore, phosphorylation is catalyzed by kinase enzymes and hence the kinase information of the phosphorylated sites has been used as major classification data in most of the existing systems. Since the number of kinase annotations in protein sequences is far less than that of the proteins being sequenced to date, the prediction systems that use the information found from the small clique of kinase annotated proteins can not be considered as completely perfect for predicting outside the clique. Hence the systems are certainly not generalized. In this paper, a novel generalized prediction system, PPRED (P hosphorylation PRED ictor) is proposed that ignores the kinase information and only uses the evolutionary information of proteins for classifying phosphorylation sites.ResultsExperimental results based on cross validations and an independent benchmark reveal the significance of using the evolutionary information alone to classify phosphorylation sites from protein sequences. The prediction performance of the proposed system is better than those of the existing prediction systems that also do not incorporate kinase information. The system is also comparable to systems that incorporate kinase information in predicting such sites.ConclusionsThe approach presented in this paper provides an efficient way to identify phosphorylation sites in a given protein primary sequence that would be a valuable information for the molecular biologists working on protein phosphorylation sites and for bioinformaticians developing generalized prediction systems for the post translational modifications like phosphorylation or glycosylation. PPRED is publicly available at the URL http://www.cse.univdhaka.edu/~ashis/ppred/index.php.

[1]  P. Cohen,et al.  The origins of protein phosphorylation , 2002, Nature Cell Biology.

[2]  Yu Xue,et al.  GPS 2.0, a Tool to Predict Kinase-specific Phosphorylation Sites in Hierarchy *S , 2008, Molecular & Cellular Proteomics.

[3]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[4]  Chih-Jen Lin,et al.  Working Set Selection Using Second Order Information for Training Support Vector Machines , 2005, J. Mach. Learn. Res..

[5]  Gajendra P. S. Raghava,et al.  Prediction of α‐turns in proteins using PSI‐BLAST profiles and secondary structure information , 2004 .

[6]  G. Singh Prediction of-turns in proteins from multiple alignment using neural network , 2002 .

[7]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[8]  M. Mann,et al.  PHOSIDA (phosphorylation site database): management, structural and evolutionary investigation, and prediction of phosphosites , 2007, Genome Biology.

[9]  Michael B. Yaffe,et al.  Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs , 2003, Nucleic Acids Res..

[10]  Nikolaj Blom,et al.  PhosphoBase, a database of phosphorylation sites: release 2.0 , 1999, Nucleic Acids Res..

[11]  Yu Xue,et al.  PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory , 2006, BMC Bioinformatics.

[12]  R. Agarwala,et al.  Protein database searches using compositionally adjusted substitution matrices , 2005, The FEBS journal.

[13]  Gajendra Pal Singh Raghava,et al.  Prediction of β‐turns in proteins from multiple alignment using neural network , 2003, Protein science : a publication of the Protein Society.

[14]  L. Iakoucheva,et al.  The importance of intrinsic disorder for protein phosphorylation. , 2004, Nucleic acids research.

[15]  M. Gerstein,et al.  Global analysis of protein phosphorylation in yeast , 2005, Nature.

[16]  Allegra Via,et al.  Phospho.ELM: a database of phosphorylation sites—update 2008 , 2007, Nucleic Acids Res..

[17]  L. Pinna,et al.  How do protein kinases recognize their substrates? , 1996, Biochimica et biophysica acta.

[18]  T. Hunter,et al.  The Croonian Lecture 1997. The phosphorylation of proteins on tyrosine: its role in cell growth and disease. , 1998, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[19]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[20]  D R Alessi,et al.  PKB/Akt: a key mediator of cell proliferation, survival and insulin responses? , 2001, Journal of cell science.

[21]  Dariusz Plewczynski,et al.  AutoMotif server: prediction of single residue post-translational modifications in proteins , 2005, Bioinform..

[22]  N. Blom,et al.  Prediction of post‐translational glycosylation and phosphorylation of proteins from the amino acid sequence , 2004, Proteomics.

[23]  Nikolaj Blom,et al.  Phospho.ELM: A database of experimentally verified phosphorylation sites in eukaryotic proteins , 2004, BMC Bioinformatics.

[24]  Eytan Domany,et al.  Finding Motifs in Promoter Regions , 2005, J. Comput. Biol..

[25]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[26]  Jorng-Tzong Horng,et al.  KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites , 2005, Nucleic Acids Res..

[27]  N. Blom,et al.  Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. , 1999, Journal of molecular biology.

[28]  Shandar Ahmad,et al.  PSSM-based prediction of DNA binding sites in proteins , 2005, BMC Bioinformatics.

[29]  Albert Y. Zomaya,et al.  Analysis of protein phosphorylation site predictors with an independent dataset , 2009, Int. J. Bioinform. Res. Appl..

[30]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[31]  William Hyde Woollaston Croonian Lecture. , 1810, The Medical and physical journal.

[32]  Kentaro Shimizu,et al.  Prediction of Protein-Protein Interaction Sites Using Only Sequence Information and Using Both Sequence and Structural Information , 2008 .

[33]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.