K-Nearest Neighbor Classifier Ensemble for Prediction of Phosphorylation Sites

Abstract —Recently, the researchers pay more attention toprediction of phosphorylation sites due to its important role inmany biological process, such as metabolism, growth, mem-brane transport, and so on. Though there exist a lot ofapproaches to predict the phosphorylation sites, few of themconsider the ensemble approach. In this paper, we first proposea new classifier ensemble framework called K-Nearest NeighborClassifier Ensemble (KNNCE) which incorporates the baggingtechnique and the K-nearest neighbor classifier into the ensem-ble framework for prediction of phosphorylation sites. Then, weapply KNNCE to six kinase families: CK1, GRK, GSK, INSR,PKB, and SRC. The experiments illustrate that (1) KNNCEachieves good results in these families, and (2) the accuraciesof the prediction system for these families are 69.25%, 69%,71.91%, 86.65%, 88.83% and 95.22% respectively. I. I NTRODUCTION Protein phosphorylation, as one of most important post-translational modifications in both prokaryotic and eukaryoticcells, is involved in the regulations of many cellular pathways[1][2], including metabolism, growth, differentiation andmembrane transport. Kinases, also known as phosphotrans-ferases, constituting a large protein superfamily, perform asthe enzymes in protein phosphorylation. In eukaryotic organ-isms, the most common form of phosphorylation is introduc-ing a phosphate group into a particular serine, threonine ortyrosine residue (phosphorylation site) of the substrate by thecatalysis of a specific kinase.Phosphorylation sites and the relevant kinases can beidentified in vivo and in vitro. Such methods include massspectrometry (MS) techniques (Aebersold et al., 2003 [3]),peptide microarray (Rychlewski et al., 2004 [4]), and phos-phospecific proteolysis (Knight, et al. 2003 [5]). Phos-pho.ELM (Diella et al, 2004 [6]) is a database of suchexperimentally verified phosphorylation sites in eukaryoticproteins. However, such methods are usually expensive andtime-consuming. With the fast growing number of proteinsequences published, computational approaches that predictphosphorylation sites more conveniently and efficiently arequite desired and quickly developed.Netphos (Blom et al., 1999 [7]) is such an early predic-tion system based on standard feed-forward artificial neuralnetwork, and it is extended to NetPhosk by Blom et al. 2004[8], which is a kinase-specific prediction system. Scansite(with the latest version 2.0) is a search tool for motifs thatare likely to be phosphorylated by specific kinases (Yaffeet al. 2001 [9]). It is based on matrix of the selectivityvalues of residues at each position relative to the experi-mentally identified phosphorylation sites. Kim et al. 2004[10] designed PredPhospho, also a kinase-specific predictionsystem, and they adopted SVM (support vector machine)as the core algorithm. Xue et al. 2005 [11] proposed agroup-based phosphorylation predicting and scoring (GPS)method, which calculates the similarity of motifs based onBLOSUM62 matrix. Xue et al. 2006 [12] also applied theapproach of Bayesian decision theory, called PPSP (Predic-tion of PK-specific Phosphorylation site). Hidden Markovmodel is also adopted by Huang et al. 2005 [13] in theirweb server KinasePhos 1.0 for computationally identifyingcatalytic kinase-specific phosphorylation sites. Wong et al.2007 [14] extended KinasePhose 1.0 to KinasePhos 2.0,which integrates SVM, protein sequence profile and proteincoupling pattern.Although there exist a number of approaches for pre-diction of phosphorylation sites, none of them considerthe classifier ensemble approach which combines multipleclassifiers to obtain more robust, stable and accurate results.In this paper, we propose a new classifier ensemble approachcalled K-Nearest Neighbor Classifier Ensemble (KNNCE)which incorporates the bagging technique and the K-nearestneighbor classifier into the ensemble framework to predicteukaryotic protein phosphorylation sites and improve theaccuracy, stability and robustness of the final predicted result.II. K-N

[1]  R. Aebersold,et al.  Mass spectrometry-based proteomics , 2003, Nature.

[2]  N. Blom,et al.  Prediction of post‐translational glycosylation and phosphorylation of proteins from the amino acid sequence , 2004, Proteomics.

[3]  Bermseok Oh,et al.  Prediction of phosphorylation sites using SVMs , 2004, Bioinform..

[4]  Leszek Rychlewski,et al.  Target specificity analysis of the Abl kinase using peptide microarray data. , 2004, Journal of molecular biology.

[5]  Nikolaj Blom,et al.  Phospho.ELM: A database of experimentally verified phosphorylation sites in eukaryotic proteins , 2004, BMC Bioinformatics.

[6]  M. Yaffe,et al.  A motif-based profile scanning approach for genome-wide prediction of signaling pathways , 2001, Nature Biotechnology.

[7]  Yu Xue,et al.  GPS: a comprehensive www server for phosphorylation sites prediction , 2005, Nucleic Acids Res..

[8]  Birgit Schilling,et al.  Phosphospecific proteolysis for mapping sites of protein phosphorylation , 2003, Nature Biotechnology.

[9]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[10]  Yu Xue,et al.  PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory , 2006, BMC Bioinformatics.

[11]  Eugene I Shakhnovich,et al.  Amino acids determining enzyme-substrate specificity in prokaryotic and eukaryotic protein kinases , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[12]  Jorng-Tzong Horng,et al.  KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites , 2005, Nucleic Acids Res..

[13]  N. Blom,et al.  Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. , 1999, Journal of molecular biology.

[14]  Václav Hlavác,et al.  Ten Lectures on Statistical and Structural Pattern Recognition , 2002, Computational Imaging and Vision.

[15]  Hsien-Da Huang,et al.  KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns , 2007, Nucleic Acids Res..