Abstract —Recently, the researchers pay more attention toprediction of phosphorylation sites due to its important role inmany biological process, such as metabolism, growth, mem-brane transport, and so on. Though there exist a lot ofapproaches to predict the phosphorylation sites, few of themconsider the ensemble approach. In this paper, we first proposea new classifier ensemble framework called K-Nearest NeighborClassifier Ensemble (KNNCE) which incorporates the baggingtechnique and the K-nearest neighbor classifier into the ensem-ble framework for prediction of phosphorylation sites. Then, weapply KNNCE to six kinase families: CK1, GRK, GSK, INSR,PKB, and SRC. The experiments illustrate that (1) KNNCEachieves good results in these families, and (2) the accuraciesof the prediction system for these families are 69.25%, 69%,71.91%, 86.65%, 88.83% and 95.22% respectively. I. I NTRODUCTION Protein phosphorylation, as one of most important post-translational modifications in both prokaryotic and eukaryoticcells, is involved in the regulations of many cellular pathways[1][2], including metabolism, growth, differentiation andmembrane transport. Kinases, also known as phosphotrans-ferases, constituting a large protein superfamily, perform asthe enzymes in protein phosphorylation. In eukaryotic organ-isms, the most common form of phosphorylation is introduc-ing a phosphate group into a particular serine, threonine ortyrosine residue (phosphorylation site) of the substrate by thecatalysis of a specific kinase.Phosphorylation sites and the relevant kinases can beidentified in vivo and in vitro. Such methods include massspectrometry (MS) techniques (Aebersold et al., 2003 [3]),peptide microarray (Rychlewski et al., 2004 [4]), and phos-phospecific proteolysis (Knight, et al. 2003 [5]). Phos-pho.ELM (Diella et al, 2004 [6]) is a database of suchexperimentally verified phosphorylation sites in eukaryoticproteins. However, such methods are usually expensive andtime-consuming. With the fast growing number of proteinsequences published, computational approaches that predictphosphorylation sites more conveniently and efficiently arequite desired and quickly developed.Netphos (Blom et al., 1999 [7]) is such an early predic-tion system based on standard feed-forward artificial neuralnetwork, and it is extended to NetPhosk by Blom et al. 2004[8], which is a kinase-specific prediction system. Scansite(with the latest version 2.0) is a search tool for motifs thatare likely to be phosphorylated by specific kinases (Yaffeet al. 2001 [9]). It is based on matrix of the selectivityvalues of residues at each position relative to the experi-mentally identified phosphorylation sites. Kim et al. 2004[10] designed PredPhospho, also a kinase-specific predictionsystem, and they adopted SVM (support vector machine)as the core algorithm. Xue et al. 2005 [11] proposed agroup-based phosphorylation predicting and scoring (GPS)method, which calculates the similarity of motifs based onBLOSUM62 matrix. Xue et al. 2006 [12] also applied theapproach of Bayesian decision theory, called PPSP (Predic-tion of PK-specific Phosphorylation site). Hidden Markovmodel is also adopted by Huang et al. 2005 [13] in theirweb server KinasePhos 1.0 for computationally identifyingcatalytic kinase-specific phosphorylation sites. Wong et al.2007 [14] extended KinasePhose 1.0 to KinasePhos 2.0,which integrates SVM, protein sequence profile and proteincoupling pattern.Although there exist a number of approaches for pre-diction of phosphorylation sites, none of them considerthe classifier ensemble approach which combines multipleclassifiers to obtain more robust, stable and accurate results.In this paper, we propose a new classifier ensemble approachcalled K-Nearest Neighbor Classifier Ensemble (KNNCE)which incorporates the bagging technique and the K-nearestneighbor classifier into the ensemble framework to predicteukaryotic protein phosphorylation sites and improve theaccuracy, stability and robustness of the final predicted result.II. K-N
[1]
R. Aebersold,et al.
Mass spectrometry-based proteomics
,
2003,
Nature.
[2]
N. Blom,et al.
Prediction of post‐translational glycosylation and phosphorylation of proteins from the amino acid sequence
,
2004,
Proteomics.
[3]
Bermseok Oh,et al.
Prediction of phosphorylation sites using SVMs
,
2004,
Bioinform..
[4]
Leszek Rychlewski,et al.
Target specificity analysis of the Abl kinase using peptide microarray data.
,
2004,
Journal of molecular biology.
[5]
Nikolaj Blom,et al.
Phospho.ELM: A database of experimentally verified phosphorylation sites in eukaryotic proteins
,
2004,
BMC Bioinformatics.
[6]
M. Yaffe,et al.
A motif-based profile scanning approach for genome-wide prediction of signaling pathways
,
2001,
Nature Biotechnology.
[7]
Yu Xue,et al.
GPS: a comprehensive www server for phosphorylation sites prediction
,
2005,
Nucleic Acids Res..
[8]
Birgit Schilling,et al.
Phosphospecific proteolysis for mapping sites of protein phosphorylation
,
2003,
Nature Biotechnology.
[9]
Leo Breiman,et al.
Bagging Predictors
,
1996,
Machine Learning.
[10]
Yu Xue,et al.
PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory
,
2006,
BMC Bioinformatics.
[11]
Eugene I Shakhnovich,et al.
Amino acids determining enzyme-substrate specificity in prokaryotic and eukaryotic protein kinases
,
2003,
Proceedings of the National Academy of Sciences of the United States of America.
[12]
Jorng-Tzong Horng,et al.
KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites
,
2005,
Nucleic Acids Res..
[13]
N. Blom,et al.
Sequence and structure-based prediction of eukaryotic protein phosphorylation sites.
,
1999,
Journal of molecular biology.
[14]
Václav Hlavác,et al.
Ten Lectures on Statistical and Structural Pattern Recognition
,
2002,
Computational Imaging and Vision.
[15]
Hsien-Da Huang,et al.
KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns
,
2007,
Nucleic Acids Res..