UbiSite: incorporating two-layered machine learning method with substrate motifs to predict ubiquitin-conjugation site on lysines

BackgroundThe conjugation of ubiquitin to a substrate protein (protein ubiquitylation), which involves a sequential process – E1 activation, E2 conjugation and E3 ligation, is crucial to the regulation of protein function and activity in eukaryotes. This ubiquitin-conjugation process typically binds the last amino acid of ubiquitin (glycine 76) to a lysine residue of a target protein. The high-throughput of mass spectrometry-based proteomics has stimulated a large-scale identification of ubiquitin-conjugated peptides. Hence, a new web resource, UbiSite, was developed to identify ubiquitin-conjugation site on lysines based on large-scale proteome dataset.ResultsGiven a total of 37,647 ubiquitin-conjugated proteins, including 128026 ubiquitylated peptides, obtained from various resources, this study carries out a large-scale investigation on ubiquitin-conjugation sites based on sequenced and structural characteristics. A TwoSampleLogo reveals that a significant depletion of histidine (H), arginine (R) and cysteine (C) residues around ubiquitylation sites may impact the conjugation of ubiquitins in closed three-dimensional environments. Based on the large-scale ubiquitylation dataset, a motif discovery tool, MDDLogo, has been adopted to characterize the potential substrate motifs for ubiquitin conjugation. Not only are single features such as amino acid composition (AAC), positional weighted matrix (PWM), position-specific scoring matrix (PSSM) and solvent-accessible surface area (SASA) considered, but also the effectiveness of incorporating MDDLogo-identified substrate motifs into a two-layered prediction model is taken into account. Evaluation by five-fold cross-validation showed that PSSM is the best feature in discriminating between ubiquitylation and non-ubiquitylation sites, based on support vector machine (SVM). Additionally, the two-layered SVM model integrating MDDLogo-identified substrate motifs could obtain a promising accuracy and the Matthews Correlation Coefficient (MCC) at 81.06 % and 0.586, respectively. Furthermore, the independent testing showed that the two-layered SVM model could outperform other prediction tools, reaching at 85.10 % sensitivity, 69.69 % specificity, 73.69 % accuracy and the 0.483 of MCC value.ConclusionThe independent testing result indicated the effectiveness of incorporating MDDLogo-identified motifs into the prediction of ubiquitylation sites. In order to provide meaningful assistance to researchers interested in large-scale ubiquitinome data, the two-layered SVM model has been implemented onto a web-based system (UbiSite), which is freely available at http://csb.cse.yzu.edu.tw/UbiSite/. Two cases given in the UbiSite provide a demonstration of effective identification of ubiquitylation sites with reference to substrate motifs.

[1]  Tzong-Yi Lee,et al.  PlantPhos: using maximal dependence decomposition to identify plant phosphorylation sites with substrate site specificity , 2011, BMC Bioinformatics.

[2]  Jorng-Tzong Horng,et al.  KinasePhos: a web tool for identifying protein kinase-specific phosphorylation sites , 2005, Nucleic Acids Res..

[3]  Jiangning Song,et al.  Towards more accurate prediction of ubiquitination sites: a comprehensive review of current methods, tools and features , 2015, Briefings Bioinform..

[4]  D T Jones,et al.  Protein secondary structure prediction based on position-specific scoring matrices. , 1999, Journal of molecular biology.

[5]  Yu-Ju Chen,et al.  dbSNO 2.0: a resource for exploring structural environment, functional and disease association and regulatory network of protein S-nitrosylation , 2014, Nucleic Acids Res..

[6]  C. Pickart,et al.  Ubiquitin: structures, functions, mechanisms. , 2004, Biochimica et biophysica acta.

[7]  M. Gromiha,et al.  Real value prediction of solvent accessibility from amino acid sequence , 2003, Proteins.

[8]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[9]  Edward L. Huttlin,et al.  Systematic and quantitative assessment of the ubiquitin-modified proteome. , 2011, Molecular cell.

[10]  G Goldstein,et al.  Isolation of a polypeptide that has lymphocyte-differentiating properties and is probably represented universally in living cells. , 1975, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Jiangning Song,et al.  hCKSAAP_UbSite: improved prediction of human ubiquitination sites by exploiting amino acid pattern and properties. , 2013, Biochimica et biophysica acta.

[12]  Tzong-Yi Lee,et al.  Carboxylator: incorporating solvent-accessible surface area for identifying protein carboxylation sites , 2011, J. Comput. Aided Mol. Des..

[13]  M. Wilkins,et al.  Surface accessibility of protein post-translational modifications. , 2007, Journal of proteome research.

[14]  Keiichi I Nakayama,et al.  Proteome-wide identification of ubiquitylation sites by conjugation of engineered lysine-less ubiquitin. , 2012, Journal of proteome research.

[15]  Vladimir Vacic,et al.  Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments , 2006, Bioinform..

[16]  K. Robert Lai,et al.  Characterization and identification of ubiquitin conjugation sites with E3 ligase recognition specificities , 2015, BMC Bioinformatics.

[17]  Tao Huang,et al.  Using WPNNA classifier in ubiquitination site prediction based on hybrid features. , 2013, Protein and peptide letters.

[18]  Tzong-Yi Lee,et al.  Incorporating Distant Sequence Features and Radial Basis Function Networks to Identify Ubiquitin Conjugation Sites , 2011, PloS one.

[19]  Yu Xue,et al.  UUCD: a family-based database of ubiquitin and ubiquitin-like conjugation , 2012, Nucleic Acids Res..

[20]  A. Weissman,et al.  HECT and RING finger families of E3 ubiquitin ligases at a glance , 2012, Journal of Cell Science.

[21]  Xiang-tao Li,et al.  Prediction of Lysine Ubiquitylation with Ensemble Classifier and Feature Selection , 2011, International journal of molecular sciences.

[22]  Xiang Chen,et al.  Incorporating key position and amino acid residue features to identify general and species-specific Ubiquitin conjugation sites , 2013, Bioinform..

[23]  Shandar Ahmad,et al.  RVP-net: online prediction of real valued accessible surface area of proteins from single sequences , 2003, Bioinform..

[24]  Shinn-Ying Ho,et al.  Computational identification of ubiquitylation sites from protein sequences , 2008, BMC Bioinformatics.

[25]  Tzong-Yi Lee,et al.  Identifying Protein Phosphorylation Sites with Kinase Substrate Specificity on Human Viruses , 2012, PloS one.

[26]  R. Mayer,et al.  Ubiquitin and ubiquitin-like proteins as multifunctional signals , 2005, Nature Reviews Molecular Cell Biology.

[27]  Tzong-Yi Lee,et al.  An Intelligent System for Identifying Acetylated Lysine on Histones and Nonhistone Proteins , 2014, BioMed research international.

[28]  V. Vacic,et al.  Identification, analysis, and prediction of protein ubiquitination sites , 2010, Proteins.

[29]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[30]  Liam J. McGuffin,et al.  Protein structure prediction servers at University College London , 2005, Nucleic Acids Res..

[31]  Tzong-Yi Lee,et al.  GSHSite: Exploiting an Iteratively Statistical Method to Identify S-Glutathionylation Sites with Substrate Specificity , 2015, PloS one.

[32]  Tao Zhou,et al.  mUbiSiDa: A Comprehensive Database for Protein Ubiquitination Sites in Mammals , 2014, PloS one.

[33]  Jorng-Tzong Horng,et al.  Incorporating support vector machine for identifying protein tyrosine sulfation sites , 2009, J. Comput. Chem..

[34]  Tzong-Yi Lee,et al.  Exploiting maximal dependence decomposition to identify conserved motifs from a group of aligned signal sequences , 2011, Bioinform..

[35]  Tzong-Yi Lee,et al.  ViralPhos: incorporating a recursively statistical method to predict phosphorylation sites on virus proteins , 2013, BMC Bioinformatics.

[36]  Hsien-Da Huang,et al.  SNOSite: Exploiting Maximal Dependence Decomposition to Identify Cysteine S-Nitrosylation with Substrate Site Specificity , 2011, PloS one.

[37]  P. Howley,et al.  Structure of an E6AP-UbcH7 complex: insights into ubiquitination by the E2-E3 enzyme cascade. , 1999, Science.

[38]  Hsien-Da Huang,et al.  KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns , 2007, Nucleic Acids Res..

[39]  Hsien-Da Huang,et al.  dbPTM: an information repository of protein post-translational modification , 2005, Nucleic Acids Res..

[40]  Jorng-Tzong Horng,et al.  Incorporating hidden Markov models for identifying protein kinase‐specific phosphorylation sites , 2005, J. Comput. Chem..

[41]  T. N. Bhat,et al.  The Protein Data Bank , 2000, Nucleic Acids Res..

[42]  Tao Huang,et al.  Prediction of lysine ubiquitination with mRMR feature selection and analysis , 2011, Amino Acids.

[43]  D. Y. Lin,et al.  Crystal structures of two bacterial HECT-like E3 ligases in complex with a human E2 reveal atomic details of pathogen-host interactions , 2012, Proceedings of the National Academy of Sciences.

[44]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[45]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[46]  Sebastian A. Wagner,et al.  A Proteome-wide, Quantitative Survey of In Vivo Ubiquitylation Sites Reveals Widespread Regulatory Roles* , 2011, Molecular & Cellular Proteomics.

[47]  Ao Li,et al.  LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST , 2005, Nucleic Acids Res..

[48]  George M. Church,et al.  Preferred in vivo ubiquitination sites , 2004, Bioinform..

[49]  D. Rotin,et al.  Physiological functions of the HECT family of ubiquitin ligases , 2009, Nature Reviews Molecular Cell Biology.

[50]  Linda Hicke,et al.  Ubiquitin-binding domains , 2005, Nature Reviews Molecular Cell Biology.

[51]  Hsien-Da Huang,et al.  dbSNO: a database of cysteine S-nitrosylation , 2012, Bioinform..

[52]  Yu-Ju Chen,et al.  Characterization and identification of protein O-GlcNAcylation sites with substrate specificity , 2014, BMC Bioinformatics.

[53]  Yu Xue,et al.  CPLM: a database of protein lysine modifications , 2013, Nucleic Acids Res..

[54]  Hsien-Da Huang,et al.  Incorporating Evolutionary Information and Functional Domains for Identifying RNA Splicing Factors in Humans , 2011, PloS one.

[55]  Hsien-Da Huang,et al.  dbPTM 3.0: an informative resource for investigating substrate site specificity and functional association of protein post-translational modifications , 2012, Nucleic Acids Res..

[56]  Keith D Wilkinson,et al.  The discovery of ubiquitin-dependent proteolysis , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[57]  Steven P Gygi,et al.  A proteomics approach to understanding protein ubiquitination , 2003, Nature Biotechnology.

[58]  C. Brenner,et al.  Yeast Chfr homologs retard cell cycle at G1 and G2/M via Ubc4 and Ubc13/Mms2-dependent ubiquitination , 2008, Cell cycle.

[59]  Yong-Zi Chen,et al.  Prediction of Ubiquitination Sites by Using the Composition of k-Spaced Amino Acid Pairs , 2011, PloS one.