Analysis and prediction of human acetylation using a cascade classifier based on support vector machine

BackgroundAcetylation on lysine is a widespread post-translational modification which is reversible and plays a crucial role in some biological activities. To better understand the mechanism, it is necessary to identify acetylation sites in proteins accurately. Computational methods are popular because they are more convenient and faster than experimental methods. In this study, we proposed a new computational method to predict acetylation sites in human by combining sequence features and structural features including physicochemical property (PCP), position specific score matrix (PSSM), auto covariation (AC), residue composition (RC), secondary structure (SS) and accessible surface area (ASA), which can well characterize the information of acetylated lysine sites. Besides, a two-step feature selection was applied, which combined mRMR and IFS. It finally trained a cascade classifier based on SVM, which successfully solved the imbalance between positive samples and negative samples and covered all negative sample information.ResultsThe performance of this method is measured with a specificity of 72.19% and a sensibility of 76.71% on independent dataset which shows that a cascade SVM classifier outperforms single SVM classifier.ConclusionsIn addition to the analysis of experimental results, we also made a systematic and comprehensive analysis of the acetylation data.

[1]  Minghao Yin,et al.  PGluS: prediction of protein S-glutathionylation sites with multiple features and analysis. , 2015, Molecular bioSystems.

[2]  K. Chou Prediction of signal peptides using scaled window , 2001, Peptides.

[3]  Shao-Ping Shi,et al.  PLMLA: prediction of lysine methylation and lysine acetylation by combining multiple features. , 2012, Molecular bioSystems.

[4]  Shao-Ping Shi,et al.  A method to distinguish between lysine acetylation and lysine methylation from protein sequences. , 2012, Journal of theoretical biology.

[5]  Geoffrey I. Webb,et al.  Accurate in silico identification of species-specific acetylation sites by integrating protein sequence-derived and functional features , 2014, Scientific Reports.

[6]  Xiang-tao Li,et al.  Prediction of Lysine Ubiquitylation with Ensemble Classifier and Feature Selection , 2011, International journal of molecular sciences.

[7]  K. Chou,et al.  iDNA6mA-PseKNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. , 2018, Genomics.

[8]  Jing-Yu Yang,et al.  A Cascade Random Forests Algorithm for Predicting Protein-Protein Interaction Sites , 2015, IEEE Transactions on NanoBioscience.

[9]  Wei Chen,et al.  iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition , 2013, Nucleic acids research.

[10]  Jing-Yu Yang,et al.  SOMPNN: an efficient non-parametric model for predicting transmembrane helices , 2011, Amino Acids.

[11]  Chaochun Wei,et al.  LAceP: Lysine Acetylation Site Prediction Using Logistic Regression Classifiers , 2014, PloS one.

[12]  Shu-Yun Huang,et al.  Position-Specific Analysis and Prediction for Protein Lysine Acetylation Based on Multiple Features , 2012, PLoS ONE.

[13]  Tomas Bergman,et al.  New developments in protein structure–function analysis by MS and use of hydrogen–deuterium exchange microfluidics , 2011, The FEBS Journal.

[14]  Q. Lei,et al.  Regulation of Metabolism by Lysine Acetylation and its Role in Metabolic Diseases , 2015 .

[15]  Kuo-Chen Chou,et al.  iRNA-2methyl: Identify RNA 2'-O-methylation Sites by Incorporating Sequence-Coupled Effects into General PseKNC and Ensemble Classifier. , 2017, Medicinal chemistry (Shariqah (United Arab Emirates)).

[16]  I. Xenarios,et al.  UniProtKB/Swiss-Prot, the Manually Annotated Section of the UniProt KnowledgeBase: How to Use the Entry View. , 2016, Methods in molecular biology.

[17]  Jingyu Yang,et al.  SOMRuler: A Novel Interpretable Transmembrane Helices Predictor , 2011, IEEE Transactions on NanoBioscience.

[18]  Hiroyuki Ogata,et al.  AAindex: Amino Acid Index Database , 1999, Nucleic Acids Res..

[19]  Yu Xue,et al.  GPS-PAIL: prediction of lysine acetyltransferase-specific modification sites from protein sequences , 2016, Scientific Reports.

[20]  Yu Shyr,et al.  Improved prediction of lysine acetylation by support vector machines. , 2009, Protein and peptide letters.

[21]  Bin Zhang,et al.  PhosphoSitePlus: a comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse , 2011, Nucleic Acids Res..

[22]  Tzong-Yi Lee,et al.  An Intelligent System for Identifying Acetylated Lysine on Histones and Nonhistone Proteins , 2014, BioMed research international.

[23]  K. Chou,et al.  iRNA-3typeA: Identifying Three Types of Modification at RNA’s Adenosine Sites , 2018, Molecular therapy. Nucleic acids.

[24]  Shahid Akbar,et al.  iMethyl-STTNC: Identification of N6-methyladenosine sites by extending the idea of SAAC into Chou's PseAAC to formulate RNA sequences. , 2018, Journal of theoretical biology.

[25]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[26]  M. Vihinen,et al.  Accuracy of protein flexibility predictions , 1994, Proteins.

[27]  Wei Zheng,et al.  Improved Species-Specific Lysine Acetylation Site Prediction Based on a Large Variety of Features Set , 2016, PloS one.

[28]  A. Holmgren,et al.  Identification of S-glutathionylated cellular proteins during oxidative stress and constitutive metabolism by affinity purification and proteomic analysis. , 2002, Archives of biochemistry and biophysics.

[29]  W. Kabsch,et al.  Dictionary of protein secondary structure: Pattern recognition of hydrogen‐bonded and geometrical features , 1983, Biopolymers.

[30]  Yanzhi Guo,et al.  Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences , 2008, Nucleic acids research.

[31]  G von Heijne,et al.  A turn propensity scale for transmembrane helices. , 1999, Journal of molecular biology.

[32]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[33]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[34]  Yu Xue,et al.  CPLM: a database of protein lysine modifications , 2013, Nucleic Acids Res..

[35]  N. Shimizu,et al.  Common anti-apoptotic roles of parkin and α-synuclein in human dopaminergic cells , 2005 .

[36]  Yu-Dong Cai,et al.  Prediction and analysis of protein methylarginine and methyllysine based on Multisequence features. , 2011, Biopolymers.

[37]  Yu Xue,et al.  PLMD: An updated data resource of protein lysine modifications. , 2017, Journal of genetics and genomics = Yi chuan xue bao.

[38]  Yan Huang,et al.  Predicting protein-ATP binding sites from primary sequence through fusing bi-profile sampling of multi-view features , 2012, BMC Bioinformatics.

[39]  Hui Ding,et al.  iRNA(m6A)-PseDNC: Identifying N6-methyladenosine sites using pseudo dinucleotide composition. , 2018, Analytical biochemistry.

[40]  Shao-Ping Shi,et al.  Proteome-wide analysis of amino acid variations that influence protein lysine acetylation. , 2013, Journal of proteome research.

[41]  T. N. Bhat,et al.  The Protein Data Bank: unifying the archive , 2002, Nucleic Acids Res..

[42]  Kuo-Chen Chou,et al.  iPTM-mLys: identifying multiple lysine PTM sites and their different types , 2016, Bioinform..

[43]  Saeed Jalili,et al.  Protein secondary structure prediction using DWKF based on SVR-NSGAII , 2012, Neurocomputing.

[44]  Dong Xu,et al.  Systematic analysis of human lysine acetylation proteins and accurate prediction of human lysine acetylation through bi-relative adapted binomial score Bayes feature representation. , 2012, Molecular bioSystems.

[45]  Shuigeng Zhou,et al.  A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation , 2009, Bioinform..

[46]  Hsien-Da Huang,et al.  N‐Ace: Using solvent accessibility and physicochemical properties to identify protein N‐acetylation sites , 2010, J. Comput. Chem..

[47]  Hanno Steen,et al.  Development of human protein reference database as an initial platform for approaching systems biology in humans. , 2003, Genome research.

[48]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[49]  Kuo-Chen Chou,et al.  iRNAm5C-PseDNC: identifying RNA 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition , 2017, Oncotarget.

[50]  T Tsujita,et al.  Dependence of conformational stability on hydrophobicity of the amino acid residue in a series of variant proteins substituted at a unique position of tryptophan synthase alpha subunit. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[51]  Nobutaka Hattori,et al.  Common anti-apoptotic roles of parkin and alpha-synuclein in human dopaminergic cells. , 2005, Biochemical and biophysical research communications.

[52]  Xiaolong Wang,et al.  A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis , 2008, BMC Bioinformatics.

[53]  Ming Lu,et al.  ASEB: a web server for KAT-specific acetylation site prediction , 2012, Nucleic Acids Res..

[54]  Jun Ding,et al.  Lysine acetylation sites prediction using an ensemble of support vector machine classifiers. , 2010, Journal of theoretical biology.

[55]  R C Wade,et al.  Prediction of protein hydration sites from sequence by modular neural networks. , 1998, Protein engineering.