Detecting Succinylation sites from protein sequences using ensemble support vector machine

BackgroundLysine succinylation is a new kind of post-translational modification which plays a key role in protein conformation regulation and cellular function control. To understand the mechanism of succinylation profoundly, it is necessary to identify succinylation sites in proteins accurately. However, traditional methods, experimental approaches, are labor-intensive and time-consuming. Computational prediction methods have been proposed recent years, and they are popular because of their convenience and high speed. In this study, we developed a new method to predict succinylation sites in protein combining multiple features, including amino acid composition, binary encoding, physicochemical property and grey pseudo amino acid composition, with a feature selection scheme (information gain). And then, it was trained using SVM (Support Vector Machine) and an ensemble learning algorithm.ResultsThe performance of this method was measured with an accuracy of 89.14% and a MCC (Matthew Correlation Coefficient) of 0.79 using 10-fold cross validation on training dataset and an accuracy of 84.5% and a MCC of 0.2 on independent dataset.ConclusionsThe conclusions made from this study can help to understand more of the succinylation mechanism. These results suggest that our method was very promising for predicting succinylation sites. The source code and data of this paper are freely available athttps://github.com/ningq669/PSuccE.

[1]  B. Efron Bootstrap Methods: Another Look at the Jackknife , 1979 .

[2]  J. Deng,et al.  Introduction to Grey system theory , 1989 .

[3]  R. Tibshirani,et al.  Monographs on statistics and applied probability , 1990 .

[4]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[5]  C E Shannon,et al.  The mathematical theory of communication. 1963. , 1997, M.D. computing : computers in medical practice.

[6]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[7]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001 .

[8]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[9]  A. Holmgren,et al.  Identification of S-glutathionylated cellular proteins during oxidative stress and constitutive metabolism by affinity purification and proteomic analysis. , 2002, Archives of biochemistry and biophysics.

[10]  N. Shimizu,et al.  Common anti-apoptotic roles of parkin and α-synuclein in human dopaminergic cells , 2005 .

[11]  N. Shimizu,et al.  Corrigendum to “Common anti-apoptotic roles of parkin and α-synuclein in human dopaminergic cells” [Biochem. Biophys. Res. Commun. 332 (2005) 233–240] , 2005 .

[12]  Kuo-Chen Chou,et al.  Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes , 2005, Bioinform..

[13]  Vladimir Vacic,et al.  Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments , 2006, Bioinform..

[14]  Rietz Lecture Bootstrap Methods : Another Look at the Jackknife B , 2007 .

[15]  Susan A. Murphy,et al.  Monographs on statistics and applied probability , 1990 .

[16]  K. Chou,et al.  Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms , 2008, Nature Protocols.

[17]  S. Berger,et al.  The emerging field of dynamic lysine methylation of non-histone proteins. , 2008, Current opinion in genetics & development.

[18]  K. Chou Pseudo Amino Acid Composition and its Applications in Bioinformatics, Proteomics and System Biology , 2009 .

[19]  V. Vacic,et al.  Identification, analysis, and prediction of protein ubiquitination sites , 2010, Proteins.

[20]  Xiang-tao Li,et al.  Prediction of Lysine Ubiquitylation with Ensemble Classifier and Feature Selection , 2011, International journal of molecular sciences.

[21]  Yu-Dong Cai,et al.  Prediction and analysis of protein methylarginine and methyllysine based on Multisequence features. , 2011, Biopolymers.

[22]  Tzong-Yi Lee,et al.  Incorporating Distant Sequence Features and Radial Basis Function Networks to Identify Ubiquitin Conjugation Sites , 2011, PloS one.

[23]  K. Chou,et al.  iDNA-Prot: Identification of DNA Binding Proteins Using Random Forest with Grey Model , 2011, PloS one.

[24]  Zhihong Zhang,et al.  Identification of lysine succinylation as a new post-translational modification. , 2011, Nature chemical biology.

[25]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[26]  J. Boeke,et al.  Lysine Succinylation and Lysine Malonylation in Histones* , 2012, Molecular & Cellular Proteomics.

[27]  K. Chou,et al.  iLoc-Hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites. , 2012, Molecular bioSystems.

[28]  Shu-Yun Huang,et al.  Position-Specific Analysis and Prediction for Protein Lysine Acetylation Based on Multiple Features , 2012, PLoS ONE.

[29]  K. Chou,et al.  Predicting Secretory Proteins of Malaria Parasite by Incorporating Sequence Evolution Information into Pseudo Amino Acid Composition via Grey System Model , 2012, PloS one.

[30]  James G. Lyons,et al.  A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition. , 2013, Journal of theoretical biology.

[31]  Yingming Zhao,et al.  SIRT5-mediated lysine desuccinylation impacts diverse metabolic pathways. , 2013, Molecular cell.

[32]  Kuo-Chen Chou,et al.  Some remarks on predicting multi-label attributes in molecular biosystems. , 2013, Molecular bioSystems.

[33]  K. Chou,et al.  iLoc-Animal: a multi-label learning classifier for predicting subcellular localization of animal proteins. , 2013, Molecular bioSystems.

[34]  Sebastian A. Wagner,et al.  Lysine succinylation is a frequently occurring modification in prokaryotes and eukaryotes and extensively overlaps with acetylation. , 2013, Cell reports.

[35]  Jacques Lapointe,et al.  Theoretical and experimental biology in one—A symposium in honour of Professor Kuo-Chen Chou’s 50th anniversary and Professor Richard Giegé’s 40th anniversary of their scientific careers , 2013 .

[36]  K. Chou,et al.  iSNO-PseAAC: Predict Cysteine S-Nitrosylation Sites in Proteins by Incorporating Position Specific Amino Acid Propensity into Pseudo Amino Acid Composition , 2013, PloS one.

[37]  Dong-Sheng Cao,et al.  propy: a tool to generate various modes of Chou's PseAAC , 2013, Bioinform..

[38]  K. Chou,et al.  iGPCR-Drug: A Web Server for Predicting Interaction between GPCRs and Drugs in Cellular Networking , 2013, PloS one.

[39]  K. Chou,et al.  iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins , 2013, PeerJ.

[40]  Yu Xue,et al.  CPLM: a database of protein lysine modifications , 2013, Nucleic Acids Res..

[41]  Cangzhi Jia,et al.  Prediction of Protein S-Nitrosylation Sites Based on Adapted Normal Distribution Bi-Profile Bayes and Chou’s Pseudo Amino Acid Composition , 2014, International journal of molecular sciences.

[42]  K. Chou,et al.  iMethyl-PseAAC: Identification of Protein Methylation Sites via a Pseudo Amino Acid Composition Approach , 2014, BioMed research international.

[43]  K. Chou,et al.  iNitro-Tyr: Prediction of Nitrotyrosine Sites in Proteins with General Pseudo Amino Acid Composition , 2014, PloS one.

[44]  Yingming Zhao,et al.  Lysine glutarylation is a protein posttranslational modification regulated by SIRT5. , 2014, Cell metabolism.

[45]  K. Chou,et al.  iHyd-PseAAC: Predicting Hydroxyproline and Hydroxylysine in Proteins by Incorporating Dipeptide Position-Specific Propensity into Pseudo Amino Acid Composition , 2014, International journal of molecular sciences.

[46]  Zhiqiang Ma,et al.  PSNO: Predicting Cysteine S-Nitrosylation Sites by Incorporating Various Sequence-Derived Features into the General Form of Chou’s PseAAC , 2014, International journal of molecular sciences.

[47]  S. Liang,et al.  Systematic identification of the lysine lactylation in the protozoan parasite Toxoplasma gondii , 2014, Parasites & Vectors.

[48]  B. O’Rourke,et al.  Metabolism leaves its mark on the powerhouse: recent progress in post-translational modifications of lysine in mitochondria , 2014, Front. Physiol..

[49]  W. Zhong,et al.  Molecular Science for Drug Development and Biomedicine , 2014, International journal of molecular sciences.

[50]  Pufeng Du,et al.  PseAAC-General: Fast Building Various Modes of General Form of Chou’s Pseudo-Amino Acid Composition for Large-Scale Protein Datasets , 2014, International journal of molecular sciences.

[51]  Maqsood Hayat,et al.  iRSpot-GAEnsC: identifing recombination spots via ensemble classifier and extending the concept of Chou’s PseAAC to formulate DNA samples , 2015, Molecular Genetics and Genomics.

[52]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[53]  Minghao Yin,et al.  PGluS: prediction of protein S-glutathionylation sites with multiple features and analysis. , 2015, Molecular bioSystems.

[54]  K. Chou,et al.  iUbiq-Lys: prediction of lysine ubiquitination sites in proteins by extracting sequence evolution information via a gray system model , 2015, Journal of biomolecular structure & dynamics.

[55]  K. Chou Impacts of bioinformatics to medicinal chemistry. , 2015, Medicinal chemistry (Shariqah (United Arab Emirates)).

[56]  Ling-Yun Wu,et al.  iSuc-PseAAC: predicting lysine succinylation in proteins by incorporating peptide position-specific propensity , 2015, Scientific Reports.

[57]  Shao-Ping Shi,et al.  SuccFind: a novel succinylation sites online prediction tool via enhanced characteristic strategy , 2015, Bioinform..

[58]  Zhiqiang Ma,et al.  Accurate in silico identification of protein succinylation sites using an iterative semi-supervised learning technique. , 2015, Journal of theoretical biology.

[59]  Kuo-Chen Chou,et al.  iHyd-PseCp: Identify hydroxyproline and hydroxylysine in proteins by incorporating sequence-coupled effects into general PseAAC , 2016, Oncotarget.

[60]  Dong Xu,et al.  Imbalanced multi-label learning for identifying antimicrobial peptides and their functional types , 2016, Bioinform..

[61]  H. Mohabatkar,et al.  Analysis and comparison of lignin peroxidases between fungi and bacteria using three different modes of Chou's general pseudo amino acid composition. , 2016, Journal of theoretical biology.

[62]  Kuo-Chen Chou,et al.  iPTM-mLys: identifying multiple lysine PTM sites and their different types , 2016, Bioinform..

[63]  Kuo-Chen Chou,et al.  iPhos-PseEn: Identifying phosphorylation sites in proteins by fusing different pseudo components into an ensemble classifier , 2016, Oncotarget.

[64]  K. Chou,et al.  Recent Progress in Predicting Posttranslational Modification Sites in Proteins. , 2015, Current topics in medicinal chemistry.

[65]  Kuo-Chen Chou,et al.  iSuc-PseOpt: Identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset. , 2016, Analytical biochemistry.

[66]  Kuo-Chen Chou,et al.  pSuc-Lys: Predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. , 2016, Journal of theoretical biology.

[67]  K. Chou,et al.  iCar-PseCp: identify carbonylation sites in proteins by Monte Carlo sampling and incorporating sequence coupled effects into general PseAAC , 2016, Oncotarget.

[68]  Md. Nurul Haque Mollah,et al.  SuccinSite: a computational tool for the prediction of protein succinylation sites by exploiting the amino acid patterns and properties. , 2016, Molecular bioSystems.

[69]  Kuo-Chen Chou,et al.  pSumo-CD: predicting sumoylation sites in proteins with covariance discriminant algorithm by incorporating sequence-coupled effects into general PseAAC , 2016, Bioinform..

[70]  K. Chou,et al.  iACP: a sequence-based tool for identifying anticancer peptides , 2016, Oncotarget.

[71]  T. Tsunoda,et al.  PSSM-Suc: Accurately predicting succinylation using position specific scoring matrix into bigram for feature extraction. , 2017, Journal of theoretical biology.

[72]  Dong Xu,et al.  iPhos‐PseEvo: Identifying Human Phosphorylated Proteins by Incorporating Evolutionary Information into General PseAAC via Grey System Theory , 2017, Molecular informatics.

[73]  T. Tsunoda,et al.  Success: evolutionary and structural properties of amino acids prove effective for succinylation site prediction , 2018, BMC Genomics.

[74]  S. Khan,et al.  Unb-DPC: Identify mycobacterial membrane protein types by incorporating un-biased dipeptide composition into Chou's general PseAAC. , 2017, Journal of theoretical biology.

[75]  T. Tsunoda,et al.  SucStruct: Prediction of succinylated lysine residues by using structural properties of amino acids. , 2017, Analytical biochemistry.

[76]  M. Bakhtiarizadeh,et al.  OOgenesis_Pred: A sequence-based method for predicting oogenesis proteins by six different modes of Chou's pseudo amino acid composition. , 2017, Journal of theoretical biology.

[77]  Abdollah Dehzangi,et al.  Improving succinylation prediction accuracy by incorporating the secondary structure via helix, strand and coil, and evolutionary information from profile bigrams , 2018, PloS one.