Recent Development of Machine Learning Methods in Microbial Phosphorylation Sites

A variety of protein post-translational modifications has been identified that control many cellular functions. Phosphorylation studies in mycobacterial organisms have shown critical importance in diverse biological processes, such as intercellular communication and cell division. Recent technical advances in high-precision mass spectrometry have determined a large number of microbial phosphorylated proteins and phosphorylation sites throughout the proteome analysis. Identification of phosphorylated proteins with specific modified residues through experimentation is often labor-intensive, costly and time-consuming. All these limitations could be overcome through the application of machine learning (ML) approaches. However, only a limited number of computational phosphorylation site prediction tools have been developed so far. This work aims to present a complete survey of the existing ML-predictors for microbial phosphorylation. We cover a variety of important aspects for developing a successful predictor, including operating ML algorithms, feature selection methods, window size, and software utility. Initially, we review the currently available phosphorylation site databases of the microbiome, the state-of-the-art ML approaches, working principles, and their performances. Lastly, we discuss the limitations and future directions of the computational ML methods for the prediction of phosphorylation.

[1]  N. Blom,et al.  Prediction of post‐translational glycosylation and phosphorylation of proteins from the amino acid sequence , 2004, Proteomics.

[2]  Agnieszka Laszkiewicz,et al.  Characterization of a membrane‐linked Ser/Thr protein kinase in Bacillus subtilis, implicated in developmental processes , 2002, Molecular microbiology.

[3]  Kuo-Chen Chou,et al.  iSuc-PseOpt: Identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset. , 2016, Analytical biochemistry.

[4]  Ying Zhang,et al.  dbPSP: a curated database for protein phosphorylation sites in prokaryotes , 2015, Database J. Biol. Databases Curation.

[5]  Balachandran Manavalan,et al.  mACPpred: A Support Vector Machine-Based Meta-Predictor for Identification of Anticancer Peptides , 2019, International journal of molecular sciences.

[6]  Balachandran Manavalan,et al.  iGHBP: Computational identification of growth hormone binding proteins from sequences using extremely randomised tree , 2018, Computational and structural biotechnology journal.

[7]  Hiroyuki Kurata,et al.  Computational Modeling of Lysine Post-Translational Modification: An Overview , 2018 .

[8]  P. Cohen Protein kinases — the major drug targets of the twenty-first century? , 2002, Nature reviews. Drug discovery.

[9]  K. Chou,et al.  iNitro-Tyr: Prediction of Nitrotyrosine Sites in Proteins with General Pseudo Amino Acid Composition , 2014, PloS one.

[10]  Edward L. Huttlin,et al.  A Tissue-Specific Atlas of Mouse Protein Phosphorylation and Expression , 2010, Cell.

[11]  Pauline Ward,et al.  Protein kinases of the human malaria parasite Plasmodium falciparum: the kinome of a divergent eukaryote , 2004, BMC Genomics.

[12]  Yu Xue,et al.  PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory , 2006, BMC Bioinformatics.

[13]  Md. Mehedi Hasan,et al.  Opinion Prediction of protein Post-Translational Modification sites: An overview , 2017 .

[14]  Jiu-Xin Tan,et al.  Evaluation of different computational methods on 5-methylcytosine sites identification , 2020, Briefings Bioinform..

[15]  Jinyan Li,et al.  Computational Identification of Protein Pupylation Sites by Using Profile-Based Composition of k-Spaced Amino Acid Pairs , 2015, PloS one.

[16]  Hiroyuki Kurata,et al.  i4mC-Mouse: Improved identification of DNA N4-methylcytosine sites in the mouse genome using multiple encoding schemes , 2020, Computational and structural biotechnology journal.

[17]  Michael B. Yaffe,et al.  Scansite 2.0: proteome-wide prediction of cell signaling interactions using short sequence motifs , 2003, Nucleic Acids Res..

[18]  Xiang Chen,et al.  Proteomic analysis and prediction of human phosphorylation sites in subcellular level reveal subcellular specificity , 2015, Bioinform..

[19]  Nalini Schaduangrat,et al.  PVPred-SCM: Improved Prediction and Analysis of Phage Virion Proteins Using a Scoring Card Method , 2020, Cells.

[20]  Wei Chen,et al.  iRNA-PseU: Identifying RNA pseudouridine sites , 2016, Molecular therapy. Nucleic acids.

[21]  Osman Sagdic,et al.  Prediction of the antimicrobial activity of walnut (Juglans regia L.) kernel aqueous extracts using artificial neural network and multiple linear regression. , 2018, Journal of microbiological methods.

[22]  G. Dunny,et al.  A eukaryotic-type Ser/Thr kinase in Enterococcus faecalis mediates antimicrobial resistance and intestinal persistence , 2007, Proceedings of the National Academy of Sciences.

[23]  Kuo-Chen Chou,et al.  pLoc_bal‐mAnimal: predict subcellular localization of animal proteins by balancing training dataset and PseAAC , 2018, Bioinform..

[24]  Anthony J. Kusalik,et al.  Computational prediction of eukaryotic phosphorylation sites , 2011, Bioinform..

[25]  Yong-Zi Chen,et al.  Prediction of Ubiquitination Sites by Using the Composition of k-Spaced Amino Acid Pairs , 2011, PloS one.

[26]  Jie Hu,et al.  Empirical comparison and analysis of web-based cell-penetrating peptide prediction tools , 2019, Briefings Bioinform..

[27]  Abdollah Dehzangi,et al.  iDNAProt-ES: Identification of DNA-binding Proteins Using Evolutionary and Structural Features , 2017, Scientific Reports.

[28]  Kuo-Chen Chou,et al.  Prediction and classification of protein subcellular location—sequence‐order effect and pseudo amino acid composition , 2003, Journal of cellular biochemistry.

[29]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[30]  S. Brunak,et al.  Analysis and prediction of mammalian protein glycation. , 2006, Glycobiology.

[31]  Hiroyuki Kurata,et al.  Large-Scale Assessment of Bioinformatics Tools for Lysine Succinylation Sites , 2019, Cells.

[32]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[33]  Hui Xie,et al.  Insulin receptor substrate 1 regulates the cellular differentiation and the matrix metallopeptidase expression of preosteoblastic cells. , 2010, The Journal of endocrinology.

[34]  Hiroyuki Kurata,et al.  Computational identification of microbial phosphorylation sites by the enhanced characteristics of sequence information , 2019, Scientific Reports.

[35]  P. Cohen,et al.  The role of protein phosphorylation in neural and hormonal control of cellular activity , 1982, Nature.

[36]  Hiroyuki Kurata,et al.  PreAIP: Computational Prediction of Anti-inflammatory Peptides by Integrating Multiple Complementary Features , 2019, Front. Genet..

[37]  M. Buttner,et al.  The vancomycin resistance VanRS two‐component signal transduction system of Streptomyces coelicolor , 2006, Molecular microbiology.

[38]  Eugene Berezikov,et al.  CONREAL: conserved regulatory elements anchored alignment algorithm for identification of transcription factor binding sites by phylogenetic footprinting. , 2003, Genome research.

[39]  D. C. Krause,et al.  Protein Kinase/Phosphatase Function Correlates with Gliding Motility in Mycoplasma pneumoniae , 2013, Journal of bacteriology.

[40]  Dianjing Guo,et al.  A systematic identification of species-specific protein succinylation sites using joint element features information , 2017, International journal of nanomedicine.

[41]  Hsien-Da Huang,et al.  dbPTM: an information repository of protein post-translational modification , 2005, Nucleic Acids Res..

[42]  Yong-Zi Chen,et al.  GANNPhos: a new phosphorylation site predictor based on a genetic algorithm integrated neural network. , 2007, Protein engineering, design & selection : PEDS.

[43]  Balachandran Manavalan,et al.  Machine-Learning-Based Prediction of Cell-Penetrating Peptides and Their Uptake Efficiency with Improved Accuracy. , 2018, Journal of proteome research.

[44]  Chanin Nantasenamat,et al.  iTTCA-Hybrid: Improved and robust identification of tumor T cell antigens by utilizing hybrid feature representation. , 2020, Analytical biochemistry.

[45]  Geoffrey I. Webb,et al.  Large-scale comparative assessment of computational predictors for lysine post-translational modification sites , 2018, Briefings Bioinform..

[46]  Hiroyuki Kurata,et al.  Computational identification of protein S-sulfenylation sites by incorporating the multiple sequence features information. , 2017, Molecular bioSystems.

[47]  Joungmok Kim,et al.  The role of YAP transcription coactivator in regulating stem cell self-renewal and differentiation. , 2010, Genes & development.

[48]  Ivan Mijakovic,et al.  Impact of phosphoproteomics on studies of bacterial physiology. , 2012, FEMS microbiology reviews.

[49]  Mohammad Ali Moni,et al.  Computational prediction of protein ubiquitination sites mapping on Arabidopsis thaliana , 2020, Comput. Biol. Chem..

[50]  Balachandran Manavalan,et al.  Evolution of Machine Learning Algorithms in the Prediction and Design of Anticancer Peptides. , 2020, Current protein & peptide science.

[51]  Y. Av‐Gay,et al.  Microbial Protein-tyrosine Kinases* , 2014, The Journal of Biological Chemistry.

[52]  Hiroyuki Kurata,et al.  A Comprehensive Review of In silico Analysis for Protein S-sulfenylation Sites. , 2018, Protein and peptide letters.

[53]  P. Kennelly,et al.  The Phosphorylation Site Database: A guide to the serine‐, threonine‐, and/or tyrosine‐phosphorylated proteins in prokaryotic organisms , 2004, Proteomics.

[54]  Nalini Schaduangrat,et al.  THPep: A machine learning-based approach for predicting tumor homing peptides , 2019, Comput. Biol. Chem..

[55]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[56]  Myeong Ok Kim,et al.  iBCE-EL: A New Ensemble Learning Framework for Improved Linear B-Cell Epitope Prediction , 2018, Front. Immunol..

[57]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[58]  Gwang Lee,et al.  PVP-SVM: Sequence-Based Prediction of Phage Virion Proteins Using a Support Vector Machine , 2018, Front. Microbiol..

[59]  Jonathan Dworkin,et al.  Eukaryote-Like Serine/Threonine Kinases and Phosphatases in Bacteria , 2011, Microbiology and Molecular Reviews.

[60]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[61]  Gary D. Bader,et al.  HyperModules: identifying clinically and phenotypically significant network modules with disease mutations for biomarker discovery , 2014, Bioinform..

[62]  Balachandran Manavalan,et al.  Machine intelligence in peptide therapeutics: A next‐generation tool for rapid disease screening , 2020, Medicinal research reviews.

[63]  Ying Ju,et al.  Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy , 2016, BMC Systems Biology.

[64]  Liam J. McGuffin,et al.  The PSIPRED protein structure prediction server , 2000, Bioinform..

[65]  Yi Shen,et al.  PKIS: computational identification of protein kinases for experimentally discovered protein phosphorylation sites , 2013, BMC Bioinformatics.

[66]  S. Nickolas,et al.  ANOVA Discriminant Analysis for Features Selected through Decision Tree Induction Method , 2011 .

[67]  Hiroyuki Kurata,et al.  GPSuc: Global Prediction of Generic and Species-specific Succinylation Sites by aggregating multiple sequence features , 2018, PloS one.

[68]  Abdul Sattar,et al.  The road not taken: retreat and diverge in local search for simplified protein structure prediction , 2013, BMC Bioinformatics.

[69]  Liujuan Cao,et al.  A novel features ranking metric with application to scalable visual and bioinformatics data classification , 2016, Neurocomputing.

[70]  Md. Nurul Haque Mollah,et al.  SuccinSite: a computational tool for the prediction of protein succinylation sites by exploiting the amino acid patterns and properties. , 2016, Molecular bioSystems.

[71]  D. Meek,et al.  Switching on p53: an essential role for protein phosphorylation? , 2013 .

[72]  Wei Chen,et al.  iProEP: A Computational Predictor for Predicting Promoter , 2019, Molecular therapy. Nucleic acids.

[73]  Leyi Wei,et al.  Meta-4mCpred: A Sequence-Based Meta-Predictor for Accurate DNA 4mC Site Prediction Using Effective Feature Representation , 2019, Molecular therapy. Nucleic acids.

[74]  Zhiqiang Ma,et al.  PSNO: Predicting Cysteine S-Nitrosylation Sites by Incorporating Various Sequence-Derived Features into the General Form of Chou’s PseAAC , 2014, International journal of molecular sciences.

[75]  T. Tsunoda,et al.  Success: evolutionary and structural properties of amino acids prove effective for succinylation site prediction , 2018, BMC Genomics.

[76]  HuangYing,et al.  CD-HIT Suite , 2010 .

[77]  Yu Xue,et al.  GPS 2.0, a Tool to Predict Kinase-specific Phosphorylation Sites in Hierarchy *S , 2008, Molecular & Cellular Proteomics.

[78]  Florian Gnad,et al.  PHOSIDA 2011: the posttranslational modification database , 2010, Nucleic Acids Res..

[79]  Gwang Lee,et al.  AIPpred: Sequence-Based Prediction of Anti-inflammatory Peptides Using Random Forest , 2018, Front. Pharmacol..

[80]  Wei Li,et al.  SysPTM 2.0: an updated systematic resource for post-translational modification , 2014, Database J. Biol. Databases Curation.

[81]  Mohamed F. Ghalwash,et al.  Minimum redundancy maximum relevance feature selection approach for temporal gene expression data , 2017, BMC Bioinformatics.

[82]  Kuo-Chen Chou,et al.  pSuc-Lys: Predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. , 2016, Journal of theoretical biology.

[83]  Jiangning Song,et al.  hCKSAAP_UbSite: improved prediction of human ubiquitination sites by exploiting amino acid pattern and properties. , 2013, Biochimica et biophysica acta.

[84]  L. Platanias,et al.  Role of Stat5 in type I interferon-signaling and transcriptional regulation. , 2003, Biochemical and biophysical research communications.

[85]  Hiroyuki Kurata,et al.  Efficient computational model for identification of antitubercular peptides by integrating amino acid patterns and properties , 2019, FEBS letters.

[86]  Yu Xue,et al.  A summary of computational resources for protein phosphorylation. , 2010, Current protein & peptide science.

[87]  D R Alessi,et al.  Mitogenic Activation, Phosphorylation, and Nuclear Translocation of Protein Kinase Bβ* , 1997, The Journal of Biological Chemistry.

[88]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[89]  Guo-Wei Wei,et al.  Quantitative Toxicity Prediction Using Topology Based Multitask Deep Neural Networks , 2017, J. Chem. Inf. Model..

[90]  Virapong Prachayasittikul,et al.  Meta-iAVP: A Sequence-Based Meta-Predictor for Improving the Prediction of Antiviral Peptides Using Effective Feature Representation , 2019, International journal of molecular sciences.

[91]  Vladimir Vapnik,et al.  An overview of statistical learning theory , 1999, IEEE Trans. Neural Networks.

[92]  Md. Nurul Haque Mollah,et al.  NTyroSite: Computational Identification of Protein Nitrotyrosine Sites Using Sequence Evolutionary Features , 2018, Molecules.

[93]  Hao Lv,et al.  iRNA-m2G: Identifying N2-methylguanosine Sites Based on Sequence-Derived Information , 2019, Molecular therapy. Nucleic acids.

[94]  S. Eschrich,et al.  Computational methods and opportunities for phosphorylation network medicine. , 2014, Translational cancer research.

[95]  Geoffrey I. Webb,et al.  PhosphoPredict: A bioinformatics tool for prediction of human kinase-specific phosphorylation substrates and sites by integrating heterogeneous feature selection , 2017, Scientific Reports.

[96]  Lukasz A. Kurgan,et al.  Prediction of integral membrane protein type by collocated hydrophobic amino acid pairs , 2009, J. Comput. Chem..

[97]  Zexian Liu,et al.  Prediction of prkC-mediated protein serine/threonine phosphorylation sites for bacteria , 2018, PloS one.

[98]  Xing Chen,et al.  EGBMMDA: Extreme Gradient Boosting Machine for MiRNA-Disease Association prediction , 2018, Cell Death & Disease.

[99]  Hiroyuki Kurata,et al.  i6mA-Fuse: improved and robust prediction of DNA 6 mA sites in the Rosaceae genome by fusing multiple feature representation , 2020, Plant Molecular Biology.

[100]  Jonathan Dworkin,et al.  Chemical basis of peptidoglycan discrimination by PrkC, a key kinase involved in bacterial resuscitation from dormancy. , 2011, Journal of the American Chemical Society.

[101]  R. Roskoski A historical overview of protein kinases and their targeted small molecule inhibitors. , 2015, Pharmacological research.

[102]  Geoffrey I. Webb,et al.  DeepCleave: a deep learning predictor for caspase and matrix metalloprotease substrates and cleavage sites , 2019, Bioinform..

[103]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[104]  Roger A. Davis,et al.  Nuclear Localization of p38 MAPK in Response to DNA Damage , 2009, International journal of biological sciences.

[105]  Rajiv Gandhi Govindaraj,et al.  Extremely-randomized-tree-based Prediction of N6-Methyladenosine Sites in Saccharomyces cerevisiae , 2020, Current genomics.

[106]  M. Somerville A postmodern moral tale: the ethics of research relationships , 2002, Nature Reviews Drug Discovery.

[107]  Ivan Mijakovic,et al.  NetPhosBac – A predictor for Ser/Thr phosphorylation sites in bacterial proteins , 2009, Proteomics.

[108]  K. Ohlsen,et al.  The impact of serine/threonine phosphorylation in Staphylococcus aureus. , 2010, International journal of medical microbiology : IJMM.

[109]  Chanin Nantasenamat,et al.  iBitter-SCM: Identification and characterization of bitter peptides using a scoring card method with propensity scores of dipeptides. , 2020, Genomics.

[110]  Zexian Liu,et al.  Prediction of serine/threonine phosphorylation sites in bacteria proteins. , 2015, Advances in experimental medicine and biology.

[111]  Geoffrey I. Webb,et al.  POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles , 2017, Bioinform..

[112]  Yu Liu,et al.  PTM-ssMP: A Web Server for Predicting Different Types of Post-translational Modification Sites Using Novel Site-specific Modification Profile , 2018, International journal of biological sciences.

[113]  Nalini Schaduangrat,et al.  HLPpred-Fuse: improved and robust prediction of hemolytic peptide and its activity by fusing multiple feature representation , 2020, Bioinform..

[114]  Wei Chen,et al.  iRNAD: a computational tool for identifying D modification sites in RNA sequence , 2019, Bioinform..

[115]  N. Blom,et al.  Sequence and structure-based prediction of eukaryotic protein phosphorylation sites. , 1999, Journal of molecular biology.

[116]  Edwin Wang,et al.  Putting benchmarks in their rightful place: The heart of computational biology , 2018, PLoS Comput. Biol..

[117]  T. Pawson,et al.  Protein phosphorylation in signaling--50 years and counting. , 2005, Trends in biochemical sciences.

[118]  Hiroyuki Kurata,et al.  SIPMA: A Systematic Identification of Protein-Protein Interactions in Zea mays Using Autocorrelation Features in a Machine-Learning Framework , 2018, 2018 IEEE 18th International Conference on Bioinformatics and Bioengineering (BIBE).

[119]  S. Mohammed,et al.  Phosphopeptide fragmentation and analysis by mass spectrometry. , 2009, Journal of mass spectrometry : JMS.

[120]  Balachandran Manavalan,et al.  Random Forest-Based Protein Model Quality Assessment (RFMQA) Using Structural Features and Potential Energy Terms , 2014, PloS one.

[121]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[122]  Ivan Mijakovic,et al.  MATERIALS AND METHODS , 1981, Green Corrosion Inhibitors: Reviews and Applications.

[123]  Shih-Hsiung Wu,et al.  Site-specific His/Asp phosphoproteomic analysis of prokaryotes reveals putative targets for drug resistance , 2017, BMC Microbiology.

[124]  Hiroyuki Kurata,et al.  Prediction of S-nitrosylation sites by integrating support vector machines and random forest. , 2019, Molecular omics.

[125]  Philippe Bogaerts,et al.  Fast and accurate predictions of protein stability changes upon mutations using statistical potentials and neural networks: PoPMuSiC-2.0 , 2009, Bioinform..

[126]  Leyi Wei,et al.  AtbPpred: A Robust Sequence-Based Prediction of Anti-Tubercular Peptides Using Extremely Randomized Trees , 2019, Computational and structural biotechnology journal.

[127]  Ivan Mijakovic,et al.  Protein-serine/threonine/tyrosine kinases in bacterial signaling and regulation. , 2013, FEMS microbiology letters.

[128]  Chanin Nantasenamat,et al.  Unraveling the bioactivity of anticancer peptides as deduced from machine learning , 2018, EXCLI journal.

[129]  K. Chou Structural bioinformatics and its impact to biomedical science. , 2004, Current medicinal chemistry.

[130]  Balachandran Manavalan,et al.  i4mC-ROSE, a bioinformatics tool for the identification of DNA N4-methylcytosine sites in the Rosaceae genome. , 2019, International journal of biological macromolecules.

[131]  Ashis Kumer Biswas,et al.  Machine learning approach to predict protein phosphorylation sites by incorporating evolutionary information , 2010, BMC Bioinformatics.

[132]  Yan Xu,et al.  DeepUbi: a deep learning framework for prediction of ubiquitination sites in proteins , 2019, BMC Bioinformatics.

[133]  Abdollah Dehzangi,et al.  iPHLoc-ES: Identification of bacteriophage protein locations using evolutionary and structural features. , 2017, Journal of theoretical biology.

[134]  M. Mann,et al.  PHOSIDA (phosphorylation site database): management, structural and evolutionary investigation, and prediction of phosphosites , 2007, Genome Biology.

[135]  Jonathan Dworkin,et al.  Ser/Thr phosphorylation as a regulatory mechanism in bacteria. , 2015, Current opinion in microbiology.