Computational identification of ubiquitylation sites from protein sequences

BackgroundUbiquitylation plays an important role in regulating protein functions. Recently, experimental methods were developed toward effective identification of ubiquitylation sites. To efficiently explore more undiscovered ubiquitylation sites, this study aims to develop an accurate sequence-based prediction method to identify promising ubiquitylation sites.ResultsWe established an ubiquitylation dataset consisting of 157 ubiquitylation sites and 3676 putative non-ubiquitylation sites extracted from 105 proteins in the UbiProt database. This study first evaluates promising sequence-based features and classifiers for the prediction of ubiquitylation sites by assessing three kinds of features (amino acid identity, evolutionary information, and physicochemical property) and three classifiers (support vector machine, k-nearest neighbor, and NaïveBayes). Results show that the set of used 531 physicochemical properties and support vector machine (SVM) are the best kind of features and classifier respectively that their combination has a prediction accuracy of 72.19% using leave-one-out cross-validation.Consequently, an informative physicochemical property mining algorithm (IPMA) is proposed to select an informative subset of 531 physicochemical properties. A prediction system UbiPred was implemented by using an SVM with the feature set of 31 informative physicochemical properties selected by IPMA, which can improve the accuracy from 72.19% to 84.44%. To further analyze the informative physicochemical properties, a decision tree method C5.0 was used to acquire if-then rule-based knowledge of predicting ubiquitylation sites. UbiPred can screen promising ubiquitylation sites from putative non-ubiquitylation sites using prediction scores. By applying UbiPred, 23 promising ubiquitylation sites were identified from an independent dataset of 3424 putative non-ubiquitylation sites, which were also validated by using the obtained prediction rules.ConclusionWe have proposed an algorithm IPMA for mining informative physicochemical properties from protein sequences to build an SVM-based prediction system UbiPred. UbiPred can predict ubiquitylation sites accompanied with a prediction score each to help biologists in identifying promising sites for experimental verification. UbiPred has been implemented as a web server and is available at http://iclab.life.nctu.edu.tw/ubipred.

[1]  Shinn-Ying Ho,et al.  Inheritable genetic algorithm for biobjective 0/1 combinatorial optimization problems and its applications , 2004, IEEE Trans. Syst. Man Cybern. Part B.

[2]  Robert Layfield,et al.  Methods for the purification of ubiquitinated Proteins , 2007, Proteomics.

[3]  M Gerstein,et al.  Volume changes on protein folding. , 1994, Structure.

[4]  Shinn-Ying Ho,et al.  POPI: predicting immunogenicity of MHC class I binding peptides by mining informative physicochemical properties , 2007, Bioinform..

[5]  A. J. Gandolfi,et al.  Proteomic identification of ubiquitinated proteins from human cells expressing His‐tagged ubiquitin , 2005, Proteomics.

[6]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[7]  P. Aloy,et al.  Relation between amino acid composition and cellular location of proteins. , 1997, Journal of molecular biology.

[8]  C. DeLisi,et al.  Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins. , 1987, Journal of molecular biology.

[9]  C. Nachtsheim Orthogonal Fractional Factorial Designs , 1985 .

[10]  David T. Jones,et al.  Improving the accuracy of transmembrane protein topology prediction using evolutionary information , 2007, Bioinform..

[11]  R. Mayer,et al.  Ubiquitin and ubiquitin-like proteins as multifunctional signals , 2005, Nature Reviews Molecular Cell Biology.

[12]  Jaap Heringa,et al.  An analysis of protein domain linkers: their classification and role in protein folding. , 2002, Protein engineering.

[13]  Steven P Gygi,et al.  A proteomics approach to understanding protein ubiquitination , 2003, Nature Biotechnology.

[14]  Steven P Gygi,et al.  A subset of membrane-associated proteins is ubiquitinated in response to mutations in the endoplasmic reticulum degradation machinery , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Shinn-Ying Ho,et al.  Intelligent evolutionary algorithms for large parameter optimization problems , 2004, IEEE Transactions on Evolutionary Computation.

[16]  Yu Xue,et al.  NBA-Palm: prediction of palmitoylation site implemented in Naïve Bayes algorithm , 2006, BMC Bioinformatics.

[17]  S. M. Lewis,et al.  Orthogonal Fractional Factorial Designs , 1986 .

[18]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[19]  Jean-Philippe Lambert,et al.  Tryptic digestion of ubiquitin standards reveals an improved strategy for identifying ubiquitinated proteins by mass spectrometry , 2007, Proteomics.

[20]  Steven P Gygi,et al.  Proteomic insights into ubiquitin and ubiquitin-like proteins. , 2005, Current opinion in chemical biology.

[21]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[22]  Gajendra P. S. Raghava,et al.  A neural network method for prediction of ?-turn types in proteins using evolutionary information , 2004, Bioinform..

[23]  Alejandro Garcia,et al.  UbiProt: a database of ubiquitylated proteins , 2007, BMC Bioinformatics.

[24]  Dariusz Plewczynski,et al.  AutoMotif server: prediction of single residue post-translational modifications in proteins , 2005, Bioinform..

[25]  L. Lerman,et al.  Ubiquitin and ubiquitin-like proteins in protein regulation. , 2007, Circulation research.

[26]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[27]  S. Rackovsky,et al.  Empirical Studies of Hydrophobicity. 1. Effect of Protein Size on the Hydrophobic Behavior of Amino Acids , 1980 .

[28]  J. Ross Quinlan,et al.  C4.5: Programs for Machine Learning , 1992 .

[29]  Toshihide Nishimura,et al.  Large‐scale analysis of the human ubiquitin‐related proteome , 2005, Proteomics.

[30]  Zee-Yong Park,et al.  A proteomics approach to identify the ubiquitinated proteins in mouse heart. , 2007, Biochemical and biophysical research communications.

[31]  Shiow-Fen Hwang,et al.  ProLoc: Prediction of protein subnuclear localization using SVM with automatic selection from physicochemical composition features , 2007, Biosyst..