A New Machine Learning Based Framework to Identify Protein Glycation Sites Using Comprehensive Features and the mRMR Method

Accumulation of the final product during glycation reaction often leads to many diseases, such as diabetes, Alzheimer’s disease and atherosclerosis. Identifying the glycation site can help researchers to understand the pathogenesis and provide new ideas on how to treat these diseases. In this paper, we develop a new predictor by using the support vector machine which apply four feature extractions to encode peptide chains such as binary code sequence, grey incidence degree, accessible surface area and secondary structure probability. The maximum relevancy minimum redundancy (mRMR) feature selection algorithm is used to select the optimal 170 features for the prediction problem. In training set, the performance of Gly-Predict is assessed with an accuracy of 84.815%, a sensitivity of 80.156%, a specificity of 88.868%, and a Matthews’s correlation coefficient (MCC) of 68.411% by k-fold cross validation (k = 5). To objectively evaluate Gly-predict, we tested our model on an independent dataset and compared with previous predictor. The results indicate that Gly-predict is superior to existing glycation site predictors.

[1]  J. Leahy Pathogenesis of type 2 diabetes mellitus. , 2005, Archives of medical research.

[2]  Yan Liu,et al.  Predict and Analyze Protein Glycation Sites with the mRMR and IFS Methods , 2015, BioMed research international.

[3]  Yongsoo Park,et al.  Induction of Apoptosis of β Cells of the Pancreas by Advanced Glycation End‐Products, Important Mediators of Chronic Complications of Diabetes Mellitus , 2008, Annals of the New York Academy of Sciences.

[4]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[5]  Jane Labadin,et al.  Feature selection based on mutual information , 2015, 2015 9th International Conference on IT in Asia (CITA).

[6]  Kuldip K. Paliwal,et al.  Highly accurate sequence-based prediction of half-sphere exposures of amino acid residues in proteins , 2016, Bioinform..

[7]  R. DeFronzo,et al.  Pathogenesis of type 2 diabetes mellitus. , 2004, The Medical clinics of North America.

[8]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[9]  María Martín,et al.  Activities at the Universal Protein Resource (UniProt) , 2013, Nucleic Acids Res..

[10]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[11]  Vladimir Vapnik,et al.  The Nature of Statistical Learning , 1995 .

[12]  Yu Xue,et al.  PLMD: An updated data resource of protein lysine modifications. , 2017, Journal of genetics and genomics = Yi chuan xue bao.

[13]  Yongsheng Ding,et al.  Using Chou's pseudo amino acid composition to predict subcellular localization of apoptosis proteins: An approach with immune genetic algorithm-based ensemble classifier , 2008, Pattern Recognit. Lett..

[14]  Laurence Lins,et al.  Analysis of accessible surface of residues in proteins , 2003, Protein science : a publication of the Protein Society.

[15]  Pedro Domingues,et al.  Glycation and oxidation of histones H2B and H1: in vitro study and characterization by mass spectrometry , 2011, Analytical and bioanalytical chemistry.

[16]  Hsien-Da Huang,et al.  KinasePhos 2.0: a web server for identifying protein kinase-specific phosphorylation sites based on sequences and coupling patterns , 2007, Nucleic Acids Res..

[17]  K. Chou,et al.  PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. , 2008, Analytical biochemistry.

[18]  Jian-Jun He,et al.  Prediction of lysine glutarylation sites by maximum relevance minimum redundancy feature selection. , 2018, Analytical biochemistry.

[19]  Yu Xue,et al.  CPLM: a database of protein lysine modifications , 2013, Nucleic Acids Res..

[20]  F. Zhou,et al.  Gly-PseAAC: Identifying protein lysine glycation through sequences. , 2017, Gene.

[21]  S. Brunak,et al.  Analysis and prediction of mammalian protein glycation. , 2006, Glycobiology.

[22]  L. Ferrucci,et al.  Does accumulation of advanced glycation end products contribute to the aging phenotype? , 2010, The journals of gerontology. Series A, Biological sciences and medical sciences.

[23]  Jiangyan Dai,et al.  Glypre: In Silico Prediction of Protein Glycation Sites by Fusing Multiple Features and Support Vector Machine , 2017, Molecules.

[24]  Deng Ju-Long,et al.  Control problems of grey systems , 1982 .

[25]  Kuo-Chen Chou,et al.  iSuc-PseOpt: Identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset. , 2016, Analytical biochemistry.

[26]  F. Alsaraj Pathogenesis of Type 2 Diabetes Mellitus , 2015 .

[27]  Abdollah Dehzangi,et al.  iProtGly‐SS: Identifying protein glycation sites using sequence and structure based features , 2018, Proteins.