PhosContext2vec: a distributed representation of residue-level sequence contexts and its application to general and kinase-specific phosphorylation site prediction

Phosphorylation is the most important type of protein post-translational modification. Accordingly, reliable identification of kinase-mediated phosphorylation has important implications for functional annotation of phosphorylated substrates and characterization of cellular signalling pathways. The local sequence context surrounding potential phosphorylation sites is considered to harbour the most relevant information for phosphorylation site prediction models. However, currently there is a lack of condensed vector representation for this important contextual information, despite the presence of varying residue-level features that can be constructed from sequence homology profiles, structural information, and physicochemical properties. To address this issue, we present PhosContext2vec which is a distributed representation of residue-level sequence contexts for potential phosphorylation sites and demonstrate its application in both general and kinase-specific phosphorylation site predictions. Benchmarking experiments indicate that PhosContext2vec could achieve promising predictive performance compared with several other existing methods for phosphorylation site prediction. We envisage that PhosContext2vec, as a new sequence context representation, can be used in combination with other informative residue-level features to improve the classification performance in a number of related bioinformatics tasks that require appropriate residue-level feature vector representation and extraction. The web server of PhosContext2vec is publicly available at http://phoscontext2vec.erc.monash.edu/.

[1]  Xin Yao,et al.  Ensemble learning via negative correlation , 1999, Neural Networks.

[2]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[3]  Yu Xue,et al.  GPS: a novel group-based phosphorylation predicting and scoring method. , 2004, Biochemical and biophysical research communications.

[4]  Ehsaneddin Asgari,et al.  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[5]  Yoshua Bengio,et al.  On the Expressive Power of Deep Architectures , 2011, ALT.

[6]  Hsien-Da Huang,et al.  KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns , 2007, Nucleic Acids Res..

[7]  Amos Bairoch,et al.  PROSITE: A Documented Database Using Patterns and Profiles as Motif Descriptors , 2002, Briefings Bioinform..

[8]  N. Blom,et al.  Prediction of post‐translational glycosylation and phosphorylation of proteins from the amino acid sequence , 2004, Proteomics.

[9]  Vasile Palade,et al.  Class Imbalance Learning Methods for Support Vector Machines , 2013 .

[10]  Natalie Wilson Human Protein Reference Database , 2004, Nature Reviews Genetics.

[11]  L. Iakoucheva,et al.  The importance of intrinsic disorder for protein phosphorylation. , 2004, Nucleic acids research.

[12]  Mike Tyers,et al.  BioGRID: a general repository for interaction datasets , 2005, Nucleic Acids Res..

[13]  Yanchun Liang,et al.  MusiteDeep: a deep‐learning framework for general and kinase‐specific phosphorylation site prediction , 2017, Bioinform..

[14]  Bo Yao,et al.  PhosphoSVM: prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector machine , 2014, Amino Acids.

[15]  S. Brunak,et al.  Quantitative Phosphoproteomics Reveals Widespread Full Phosphorylation Site Occupancy During Mitosis , 2010, Science Signaling.

[16]  Tom Lenaerts,et al.  From protein sequence to dynamics and disorder with DynaMine , 2013, Nature Communications.

[17]  T. Sejnowski,et al.  Predicting the secondary structure of globular proteins using neural network models. , 1988, Journal of molecular biology.

[18]  Geoffrey I. Webb,et al.  GlycoMine: a machine learning-based approach for predicting N-, C- and O-linked glycosylation in the human proteome , 2015, Bioinform..

[19]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[20]  Geoffrey I. Webb,et al.  PROSPER: An Integrated Feature-Based Tool for Predicting Protease Substrate Cleavage Sites , 2012, PloS one.

[21]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[22]  Allegra Via,et al.  Phospho.ELM: a database of phosphorylation sites—update 2008 , 2007, Nucleic Acids Res..

[23]  Anthony Kusalik,et al.  DAPPLE 2: a Tool for the Homology-Based Prediction of Post-Translational Modification Sites. , 2016, Journal of proteome research.

[24]  Yaoqi Zhou,et al.  Predicting the errors of predicted local backbone angles and non-local solvent- accessibilities of proteins by deep neural networks , 2016, Bioinform..

[25]  Yu Xue,et al.  GPS 2.0, a Tool to Predict Kinase-specific Phosphorylation Sites in Hierarchy *S , 2008, Molecular & Cellular Proteomics.

[26]  W. Taylor,et al.  The classification of amino acid conservation. , 1986, Journal of theoretical biology.

[27]  J. Ferrell,et al.  Mechanisms of specificity in protein phosphorylation , 2007, Nature Reviews Molecular Cell Biology.

[28]  Wesley J. Chun,et al.  Python Web Development with Django , 2008 .

[29]  Enhong Chen,et al.  Word Embedding Revisited: A New Representation Learning and Explicit Matrix Factorization Perspective , 2015, IJCAI.

[30]  Gholamreza Haffari,et al.  PROSPERous: high-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy , 2018, Bioinform..

[31]  O. Lichtarge,et al.  A family of evolution-entropy hybrid methods for ranking protein residues by importance. , 2004, Journal of molecular biology.

[32]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[33]  Kuldip K. Paliwal,et al.  Sixty-five years of the long march in protein secondary structure prediction: the final stretch? , 2016, Briefings Bioinform..

[34]  Geoffrey I. Webb,et al.  TANGLE: Two-Level Support Vector Regression Approach for Protein Backbone Torsion Angle Prediction from Primary Sequences , 2012, PloS one.

[35]  Mikael Bodén,et al.  PhosphoPICK: modelling cellular context to map kinase-substrate phosphorylation events , 2015, Bioinform..

[36]  James E. Ferrell,et al.  A Mechanism for the Evolution of Phosphorylation Sites , 2011, Cell.

[37]  Hsien-Da Huang,et al.  RegPhos: a system to explore the protein kinase–substrate phosphorylation network in humans , 2010, Nucleic Acids Res..

[38]  Dong Xu,et al.  Musite, a Tool for Global Prediction of General and Kinase-specific Phosphorylation Sites* , 2010, Molecular & Cellular Proteomics.

[39]  Geoffrey I. Webb,et al.  PhosphoPredict: A bioinformatics tool for prediction of human kinase-specific phosphorylation substrates and sites by integrating heterogeneous feature selection , 2017, Scientific Reports.

[40]  David T. Jones,et al.  DISOPRED3: precise disordered region predictions with annotated protein-binding activity , 2014, Bioinform..

[41]  Xing-Ming Zhao,et al.  Cascleave 2.0, a new approach for predicting caspase and granzyme cleavage targets , 2014, Bioinform..

[42]  James G. Lyons,et al.  Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning , 2015, Scientific Reports.

[43]  Gholamreza Haffari,et al.  Incorporating Side Information into Recurrent Neural Network Language Models , 2016, NAACL.

[44]  D. Eisenberg,et al.  Correlation of sequence hydrophobicities measures similarity in three-dimensional protein structure. , 1983, Journal of molecular biology.

[45]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[46]  Jorng-Tzong Horng,et al.  KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites , 2005, Nucleic Acids Res..

[47]  N. Blom,et al.  Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. , 1999, Journal of molecular biology.

[48]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[49]  Jun Wang,et al.  L1pred: A Sequence-Based Prediction Tool for Catalytic Residues in Enzymes with the L1-logreg Classifier , 2012, PloS one.

[50]  Nir Friedman,et al.  Bayesian Network Classifiers , 1997, Machine Learning.

[51]  angesichts der Corona-Pandemie,et al.  UPDATE , 1973, The Lancet.

[52]  Ruedi Aebersold,et al.  PhosphoPep—a database of protein phosphorylation sites in model organisms , 2008, Nature Biotechnology.

[53]  Geoffrey I. Webb,et al.  Cascleave: towards more accurate prediction of caspase substrate cleavage sites , 2010, Bioinform..

[54]  Fredrik Johansson,et al.  A comparative study of conservation and variation scores , 2010, BMC Bioinformatics.

[55]  Yu Xue,et al.  GPS 2.1: enhanced prediction of kinase-specific phosphorylation sites with an algorithm of motif length selection. , 2011, Protein engineering, design & selection : PEDS.

[56]  Robert Schmidt,et al.  PhosPhAt: the Arabidopsis thaliana phosphorylation site database. An update , 2009, Nucleic Acids Res..

[57]  S. Mathivanan,et al.  A curated compendium of phosphorylation motifs , 2007, Nature Biotechnology.

[58]  Nikolaj Blom,et al.  PhosphoBase, a database of phosphorylation sites: release 2.0 , 1999, Nucleic Acids Res..

[59]  Sandhya Rani,et al.  Human Protein Reference Database—2009 update , 2008, Nucleic Acids Res..

[60]  Geoffrey I. Webb,et al.  Accurate in silico identification of species-specific acetylation sites by integrating protein sequence-derived and functional features , 2014, Scientific Reports.

[61]  Daniel W. A. Buchan,et al.  Scalable web services for the PSIPRED Protein Analysis Workbench , 2013, Nucleic Acids Res..