PhosphoPredict: A bioinformatics tool for prediction of human kinase-specific phosphorylation substrates and sites by integrating heterogeneous feature selection

Protein phosphorylation is a major form of post-translational modification (PTM) that regulates diverse cellular processes. In silico methods for phosphorylation site prediction can provide a useful and complementary strategy for complete phosphoproteome annotation. Here, we present a novel bioinformatics tool, PhosphoPredict, that combines protein sequence and functional features to predict kinase-specific substrates and their associated phosphorylation sites for 12 human kinases and kinase families, including ATM, CDKs, GSK-3, MAPKs, PKA, PKB, PKC, and SRC. To elucidate critical determinants, we identified feature subsets that were most informative and relevant for predicting substrate specificity for each individual kinase family. Extensive benchmarking experiments based on both five-fold cross-validation and independent tests indicated that the performance of PhosphoPredict is competitive with that of several other popular prediction tools, including KinasePhos, PPSP, GPS, and Musite. We found that combining protein functional and sequence features significantly improves phosphorylation site prediction performance across all kinases. Application of PhosphoPredict to the entire human proteome identified 150 to 800 potential phosphorylation substrates for each of the 12 kinases or kinase families. PhosphoPredict significantly extends the bioinformatics portfolio for kinase function analysis and will facilitate high-throughput identification of kinase-specific phosphorylation sites, thereby contributing to both basic and translational research programs.

[1]  N. Blom,et al.  Prediction of post‐translational glycosylation and phosphorylation of proteins from the amino acid sequence , 2004, Proteomics.

[2]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[3]  Lei Chen,et al.  Discriminating between Lysine Sumoylation and Lysine Acetylation Using mRMR Feature Selection and Analysis , 2014, PloS one.

[4]  Chad J. Miller,et al.  Kinome-wide Decoding of Network-Attacking Mutations Rewiring Cancer Signaling , 2015, Cell.

[5]  L. Iakoucheva,et al.  The importance of intrinsic disorder for protein phosphorylation. , 2004, Nucleic acids research.

[6]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[7]  E. Birney,et al.  Pfam: the protein families database , 2013, Nucleic Acids Res..

[8]  P. Bork,et al.  Systematic Discovery of In Vivo Phosphorylation Networks , 2007, Cell.

[9]  Yu Xue,et al.  GPS 2.1: enhanced prediction of kinase-specific phosphorylation sites with an algorithm of motif length selection. , 2011, Protein engineering, design & selection : PEDS.

[10]  Dongsup Kim,et al.  PostMod: sequence based prediction of kinase-specific phosphorylation sites with indirect relationship , 2010, BMC Bioinformatics.

[11]  Christian von Mering,et al.  STRING 8—a global view on proteins and their functional interactions in 630 organisms , 2008, Nucleic Acids Res..

[12]  Mikael Bodén,et al.  PhosphoPICK: modelling cellular context to map kinase-substrate phosphorylation events , 2015, Bioinform..

[13]  Bin Zhang,et al.  PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse , 2011, Nucleic Acids Res..

[14]  J. S. Sodhi,et al.  Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. , 2004, Journal of molecular biology.

[15]  Ron Kohavi,et al.  Wrappers for Feature Subset Selection , 1997, Artif. Intell..

[16]  P. Kaldis,et al.  Cdks, cyclins and CKIs: roles beyond cell cycle regulation , 2013, Development.

[17]  Kevin N. Dalby,et al.  Substrate Discrimination among Mitogen-activated Protein Kinases through Distinct Docking Sequence Motifs* , 2008, Journal of Biological Chemistry.

[18]  N. Blom,et al.  Identification of phosphorylation sites in protein kinase A substrates using artificial neural networks and mass spectrometry. , 2004, Journal of proteome research.

[19]  Dong Xu,et al.  Musite, a Tool for Global Prediction of General and Kinase-specific Phosphorylation Sites* , 2010, Molecular & Cellular Proteomics.

[20]  Joachim Selbig,et al.  PhosPhAt: a database of phosphorylation sites in Arabidopsis thaliana and a plant-specific phosphorylation site predictor , 2007, Nucleic Acids Res..

[21]  Sean J Humphrey,et al.  High-throughput phosphoproteomics reveals in vivo insulin signaling dynamics , 2015, Nature Biotechnology.

[22]  M. Mann,et al.  Ultradeep human phosphoproteome reveals a distinct regulatory nature of Tyr and Ser/Thr-based signaling. , 2014, Cell reports.

[23]  C. C. Santini,et al.  Unmasking Determinants of Specificity in the Human Kinome , 2015, Cell.

[24]  Kuo-Chen Chou,et al.  Prediction of Protein Domain with mRMR Feature Selection and Analysis , 2012, PloS one.

[25]  Geoffrey I. Webb,et al.  Cascleave: towards more accurate prediction of caspase substrate cleavage sites , 2010, Bioinform..

[26]  C. Svanborg,et al.  Targeting of nucleotide-binding proteins by HAMLET—a conserved tumor cell death mechanism , 2016, Oncogene.

[27]  L. Baum,et al.  Statistical Inference for Probabilistic Functions of Finite State Markov Chains , 1966 .

[28]  Dong Xu,et al.  The Musite open-source framework for phosphorylation-site prediction , 2010, BMC Bioinformatics.

[29]  Susumu Goto,et al.  KEGG for integration and interpretation of large-scale molecular data sets , 2011, Nucleic Acids Res..

[30]  L. Johnson The regulation of protein phosphorylation. , 2009, Biochemical Society transactions.

[31]  Yu Xue,et al.  GPS 2.0, a Tool to Predict Kinase-specific Phosphorylation Sites in Hierarchy *S , 2008, Molecular & Cellular Proteomics.

[32]  María Martín,et al.  The Universal Protein Resource (UniProt) in 2010 , 2010 .

[33]  Jorng-Tzong Horng,et al.  KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites , 2005, Nucleic Acids Res..

[34]  R. Nagarajan,et al.  Novel approach for selecting the best predictor for identifying the binding sites in DNA binding proteins , 2013, Nucleic acids research.

[35]  Geoffrey I. Webb,et al.  Accurate in silico identification of species-specific acetylation sites by integrating protein sequence-derived and functional features , 2014, Scientific Reports.

[36]  Christopher J. Oldfield,et al.  The unfoldomics decade: an update on intrinsically disordered proteins , 2008, BMC Genomics.

[37]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[38]  Yu-Dong Cai,et al.  Prediction of Protein Cleavage Site with Feature Selection by Random Forest , 2012, PloS one.

[39]  B. Schölkopf,et al.  Advances in kernel methods: support vector learning , 1999 .

[40]  Hsien-Da Huang,et al.  KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns , 2007, Nucleic Acids Res..

[41]  Yu Xue,et al.  PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory , 2006, BMC Bioinformatics.

[42]  Jean Yee Hwa Yang,et al.  Positive-unlabeled ensemble learning for kinase substrate prediction from dynamic phosphoproteomics data , 2015, Bioinform..

[43]  Andy Liaw,et al.  Classification and Regression by randomForest , 2007 .

[44]  Hanno Steen,et al.  Development of human protein reference database as an initial platform for approaching systems biology in humans. , 2003, Genome research.

[45]  Witold R. Rudnicki,et al.  Feature Selection with the Boruta Package , 2010 .

[46]  Cathy H. Wu,et al.  The Universal Protein Resource (UniProt) , 2004, Nucleic Acids Res..

[47]  Ashis Kumer Biswas,et al.  Machine learning approach to predict protein phosphorylation sites by incorporating evolutionary information , 2010, BMC Bioinformatics.

[48]  Yi Shen,et al.  Prediction of protein kinase-specific phosphorylation sites in hierarchical structure using functional information and random forest , 2014, Amino Acids.

[49]  Bermseok Oh,et al.  Prediction of phosphorylation sites using SVMs , 2004, Bioinform..

[50]  Ruedi Aebersold,et al.  Mass-spectrometric exploration of proteome structure and function , 2016, Nature.

[51]  T. Hunter,et al.  The Protein Kinase Complement of the Human Genome , 2002, Science.

[52]  Leszek Rychlewski,et al.  ELM server: a new resource for investigating short functional sites in modular eukaryotic proteins , 2003, Nucleic Acids Res..

[53]  Koenraad Van Leemput,et al.  Prediction of kinase-specific phosphorylation sites using conditional random fields , 2008, Bioinform..

[54]  Yu Xue,et al.  GPS: a comprehensive www server for phosphorylation sites prediction , 2005, Nucleic Acids Res..

[55]  Geoffrey I. Webb,et al.  GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome , 2015, Bioinform..

[56]  Shao-Ping Shi,et al.  PSEA: Kinase-specific prediction and analysis of human phosphorylation substrates , 2014, Scientific Reports.

[57]  Aleksey A. Porollo,et al.  Linear Regression Models for Solvent Accessibility Prediction in Proteins , 2005, J. Comput. Biol..

[58]  Jianmin Wu,et al.  The kinome 'at large' in cancer , 2016, Nature Reviews Cancer.

[59]  Alan Bridge,et al.  New and continuing developments at PROSITE , 2012, Nucleic Acids Res..

[60]  M. Mann,et al.  PHOSIDA (phosphorylation site database): management, structural and evolutionary investigation, and prediction of phosphosites , 2007, Genome Biology.

[61]  Peter Dalgaard,et al.  R Development Core Team (2010): R: A language and environment for statistical computing , 2010 .

[62]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[63]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[64]  Nikolaj Blom,et al.  Kinase-specific prediction of protein phosphorylation sites. , 2009, Methods in molecular biology.

[65]  Huaiyu Mi,et al.  The InterPro protein families database: the classification resource after 15 years , 2014, Nucleic Acids Res..

[66]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[67]  M. Noble,et al.  Recent developments in cyclin-dependent kinase biochemical and structural studies. , 2010, Biochimica et biophysica acta.

[68]  J. Schlessinger,et al.  Signaling by Receptor Tyrosine Kinases , 1993 .

[69]  Pedro Larrañaga,et al.  A review of feature selection techniques in bioinformatics , 2007, Bioinform..

[70]  S. Mathivanan,et al.  A curated compendium of phosphorylation motifs , 2007, Nature Biotechnology.

[71]  Nikolaj Blom,et al.  Phospho.ELM: A database of experimentally verified phosphorylation sites in eukaryotic proteins , 2004, BMC Bioinformatics.

[72]  Y. Shiloh,et al.  Functional link between ataxia-telangiectasia and Nijmegen breakage syndrome gene products , 2000, Nature.

[73]  Allegra Via,et al.  Phospho.ELM: a database of phosphorylation sites—update 2008 , 2007, Nucleic Acids Res..

[74]  Minoru Kanehisa,et al.  Molecular network analysis of diseases and drugs in KEGG. , 2013, Methods in molecular biology.

[75]  M. Michael Gromiha,et al.  Folding RaCe: a robust method for predicting changes in protein folding rates upon point mutations , 2015, Bioinform..

[76]  Xing-Ming Zhao,et al.  Cascleave 2.0, a new approach for predicting caspase and granzyme cleavage targets , 2014, Bioinform..

[77]  J. Schlessinger Cell Signaling by Receptor Tyrosine Kinases , 2000, Cell.

[78]  Tingting Li,et al.  Identifying Human Kinase-Specific Protein Phosphorylation Sites by Integrating Heterogeneous Information from Various Sources , 2010, PloS one.

[79]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[80]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[81]  Xuegong Zhang,et al.  Prediction of kinase‐specific phosphorylation sites with sequence features by a log‐odds ratio approach , 2007, Proteins.

[82]  Allegra Via,et al.  Phospho.ELM: a database of phosphorylation sites—update 2008 , 2008, Nucleic Acids Res..

[83]  Jiangning Song,et al.  Prediction of cis/trans isomerization in proteins using PSI-BLAST profiles and secondary structure information , 2006, BMC Bioinformatics.

[84]  Ao Li,et al.  Improving the performance of protein kinase identification via high dimensional protein-protein interactions and substrate structure data. , 2014, Molecular bioSystems.

[85]  S. Henikoff,et al.  Amino acid substitution matrices from protein blocks. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[86]  Zoran Obradovic,et al.  The protein trinity—linking function and disorder , 2001, Nature Biotechnology.

[87]  L. Pinna,et al.  How do protein kinases recognize their substrates? , 1996, Biochimica et biophysica acta.

[88]  Dirk Walther,et al.  The Roles of Post-translational Modifications in the Context of Protein Interaction Networks , 2015, PLoS Comput. Biol..

[89]  Azuraliza Abu Bakar,et al.  A review of feature selection techniques in sentiment analysis , 2019, Intell. Data Anal..