Accurate in silico identification of protein succinylation sites using an iterative semi-supervised learning technique.

As a widespread type of protein post-translational modifications (PTMs), succinylation plays an important role in regulating protein conformation, function and physicochemical properties. Compared with the labor-intensive and time-consuming experimental approaches, computational predictions of succinylation sites are much desirable due to their convenient and fast speed. Currently, numerous computational models have been developed to identify PTMs sites through various types of two-class machine learning algorithms. These methods require both positive and negative samples for training. However, designation of the negative samples of PTMs was difficult and if it is not properly done can affect the performance of computational models dramatically. So that in this work, we implemented the first application of positive samples only learning (PSoL) algorithm to succinylation sites prediction problem, which was a special class of semi-supervised machine learning that used positive samples and unlabeled samples to train the model. Meanwhile, we proposed a novel succinylation sites computational predictor called SucPred (succinylation site predictor) by using multiple feature encoding schemes. Promising results were obtained by the SucPred predictor with an accuracy of 88.65% using 5-fold cross validation on the training dataset and an accuracy of 84.40% on the independent testing dataset, which demonstrated that the positive samples only learning algorithm presented here was particularly useful for identification of protein succinylation sites. Besides, the positive samples only learning algorithm can be applied to build predictors for other types of PTMs sites with ease. A web server for predicting succinylation sites was developed and was freely accessible at http://59.73.198.144:8088/SucPred/.

[1]  Dong Xu,et al.  Systematic analysis of human lysine acetylation proteins and accurate prediction of human lysine acetylation through bi-relative adapted binomial score Bayes feature representation. , 2012, Molecular bioSystems.

[2]  Yu Xue,et al.  CPLM: a database of protein lysine modifications , 2013, Nucleic Acids Res..

[3]  R. Backofen,et al.  Semi-Supervised Prediction of SH2-Peptide Interactions from Imbalanced High-Throughput Data , 2013, PloS one.

[4]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[5]  Jorng-Tzong Horng,et al.  Incorporating support vector machine for identifying protein tyrosine sulfation sites , 2009, J. Comput. Chem..

[6]  Zexian Liu,et al.  GPS-YNO2: computational prediction of tyrosine nitration sites in proteins. , 2011, Molecular bioSystems.

[7]  K. Chou,et al.  Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms , 2008, Nature Protocols.

[8]  Zhihong Zhang,et al.  Identification of lysine succinylation as a new post-translational modification. , 2011, Nature chemical biology.

[9]  Xiang-tao Li,et al.  Prediction of Lysine Ubiquitylation with Ensemble Classifier and Feature Selection , 2011, International journal of molecular sciences.

[10]  Shao-Ping Shi,et al.  PredSulSite: prediction of protein tyrosine sulfation sites with multiple features and analysis. , 2012, Analytical biochemistry.

[11]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[12]  K. Chou,et al.  iSNO-PseAAC: Predict Cysteine S-Nitrosylation Sites in Proteins by Incorporating Position Specific Amino Acid Propensity into Pseudo Amino Acid Composition , 2013, PloS one.

[13]  Andreas Barth,et al.  Influence of the molecular environment on phosphorylated amino acid models: a density functional theory study. , 2012, The journal of physical chemistry. B.

[14]  Sebastian A. Wagner,et al.  Lysine succinylation is a frequently occurring modification in prokaryotes and eukaryotes and extensively overlaps with acetylation. , 2013, Cell reports.

[15]  Shu-Yun Huang,et al.  PMeS: Prediction of Methylation Sites Based on Enhanced Feature Encoding Scheme , 2012, PloS one.

[16]  Wanli Zuo,et al.  Learning from Positive and Unlabeled Examples: A Survey , 2008, 2008 International Symposiums on Information Processing.

[17]  Vincenzo Paduano,et al.  A negative selection heuristic to predict new transcriptional targets , 2013, BMC Bioinformatics.

[18]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[19]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[20]  J. Boeke,et al.  Lysine Succinylation and Lysine Malonylation in Histones* , 2012, Molecular & Cellular Proteomics.

[21]  Ling-Yun Wu,et al.  Prediction of palmitoylation sites using the composition of k-spaced amino acid pairs. , 2009, Protein engineering, design & selection : PEDS.

[22]  Mark Gerstein,et al.  Genome-wide sequence-based prediction of peripheral proteins using a novel semi-supervised learning technique , 2010, BMC Bioinformatics.

[23]  Yu-Dong Cai,et al.  Prediction and analysis of protein methylarginine and methyllysine based on Multisequence features. , 2011, Biopolymers.

[24]  Jonathan Qiang Jiang,et al.  Predicting Protein Function by Multi-Label Correlated Semi-Supervised Learning , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[25]  Xiaoli Li,et al.  Learning to Classify Texts Using Positive and Unlabeled Data , 2003, IJCAI.

[26]  Shao-Ping Shi,et al.  The prediction of palmitoylation site locations using a multiple feature extraction method. , 2013, Journal of molecular graphics & modelling.

[27]  Philip S. Yu,et al.  Partially Supervised Classification of Text Documents , 2002, ICML.

[28]  Liang Fu,et al.  Combining random forest with multi-amino acid features to identify protein palmitoylation sites , 2014 .

[29]  Shao-Ping Shi,et al.  PLMLA: prediction of lysine methylation and lysine acetylation by combining multiple features. , 2012, Molecular bioSystems.

[30]  Shinn-Ying Ho,et al.  Computational identification of ubiquitylation sites from protein sequences , 2008, BMC Bioinformatics.

[31]  Chris H. Q. Ding,et al.  PSoL: a positive sample only learning algorithm for finding non-coding RNA genes , 2006, Bioinform..

[32]  Zong Dai,et al.  Identification of protein methylation sites by coupling improved ant colony optimization algorithm and support vector machine. , 2011, Analytica chimica acta.

[33]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.