Phosphorylation sites prediction using Random Forest

Protein phosphorylation is one of the most widespread regulatory mechanisms in eukaryotes. Over the past decade, phosphorylation site prediction has emerged as an important problem in the field of bioinformatics. Here, we report a new method, termed Random Forest-based Phosphosite predictor 1.0 (RF-Phos 1.0), to predict phosphorylation sites given only the primary amino acid sequence of a protein as input. RF-Phos 1.0, which uses random forest classifiers to integrate various sequence and structural features, is able to identify putative sites of phosphorylation across many protein families. In side-by-side comparisons based on 10-fold cross validation and an independent dataset, RF-Phos 1.0 compares favorably to other existing phosphosite prediction methods, such as PhosphoSVM, GPS2.1 and Musite.

[1]  Heng Zhu,et al.  Toward a systems-level view of dynamic phosphorylation networks , 2014, Front. Genet..

[2]  G Schneider,et al.  The rational design of amino acid sequences by artificial neural networks and simulated molecular evolution: de novo design of an idealized leader peptidase cleavage site. , 1994, Biophysical journal.

[3]  Anthony J. Kusalik,et al.  Computational phosphorylation site prediction in plants using random forests and organism-specific instance weights , 2013, Bioinform..

[4]  Yu Xue,et al.  GPS 2.0, a Tool to Predict Kinase-specific Phosphorylation Sites in Hierarchy *S , 2008, Molecular & Cellular Proteomics.

[5]  Dong-Sheng Cao,et al.  propy: a tool to generate various modes of Chou's PseAAC , 2013, Bioinform..

[6]  Mikael Bodén,et al.  DLocalMotif: a discriminative approach for discovering local motifs in protein sequences , 2013, Bioinform..

[7]  Cathryn M. Gould,et al.  Phospho.ELM: a database of phosphorylation sites—update 2011 , 2010, Nucleic acids research.

[8]  Mona Singh,et al.  Predicting functionally important residues from sequence conservation , 2007, Bioinform..

[9]  Allegra Via,et al.  Phospho.ELM: a database of phosphorylation sites—update 2008 , 2007, Nucleic Acids Res..

[10]  Shandar Ahmad,et al.  RVP-net: online prediction of real valued accessible surface area of proteins from single sequences , 2003, Bioinform..

[11]  N. Blom,et al.  Prediction of post‐translational glycosylation and phosphorylation of proteins from the amino acid sequence , 2004, Proteomics.

[12]  Chun Li,et al.  Similarity analysis of protein sequences based on the normalized relative-entropy. , 2008, Combinatorial chemistry & high throughput screening.

[13]  Bo Yao,et al.  PhosphoSVM: prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector machine , 2014, Amino Acids.

[14]  Yi Shen,et al.  Prediction of protein kinase-specific phosphorylation sites in hierarchical structure using functional information and random forest , 2014, Amino Acids.

[15]  N. Blom,et al.  Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. , 1999, Journal of molecular biology.

[16]  Ivan Erill,et al.  A reexamination of information theory-based methods for DNA-binding site identification , 2009, BMC Bioinformatics.

[17]  Xiaoqi Zheng,et al.  Prediction of catalytic residues based on an overlapping amino acid classification , 2010, Amino Acids.

[18]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[19]  Dong Xu,et al.  Musite, a Tool for Global Prediction of General and Kinase-specific Phosphorylation Sites* , 2010, Molecular & Cellular Proteomics.

[20]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .