iDPGK: characterization and identification of lysine phosphoglycerylation sites based on sequence-based features

Background Protein phosphoglycerylation, the addition of a 1,3-bisphosphoglyceric acid (1,3-BPG) to a lysine residue of a protein and thus to form a 3-phosphoglyceryl-lysine, is a reversible and non-enzymatic post-translational modification (PTM) and plays a regulatory role in glucose metabolism and glycolytic process. As the number of experimentally verified phosphoglycerylated sites has increased significantly, statistical or machine learning methods are imperative for investigating the characteristics of phosphoglycerylation sites. Currently, research into phosphoglycerylation is very limited, and only a few resources are available for the computational identification of phosphoglycerylation sites. Result We present a bioinformatics investigation of phosphoglycerylation sites based on sequence-based features. The TwoSampleLogo analysis reveals that the regions surrounding the phosphoglycerylation sites contain a high relatively of positively charged amino acids, especially in the upstream flanking region. Additionally, the non-polar and aliphatic amino acids are more abundant surrounding phosphoglycerylated lysine following the results of PTM-Logo, which may play a functional role in discriminating between phosphoglycerylation and non-phosphoglycerylation sites. Many types of features were adopted to build the prediction model on the training dataset, including amino acid composition, amino acid pair composition, positional weighted matrix and position-specific scoring matrix. Further, to improve the predictive power, numerous top features ranked by F-score were considered as the final combination for classification, and thus the predictive models were trained using DT, RF and SVM classifiers. Evaluation by five-fold cross-validation showed that the selected features was most effective in discriminating between phosphoglycerylated and non-phosphoglycerylated sites. Conclusion The SVM model trained with the selected sequence-based features performed well, with a sensitivity of 77.5%, a specificity of 73.6%, an accuracy of 74.9%, and a Matthews Correlation Coefficient value of 0.49. Furthermore, the model also consistently provides the effective performance in independent testing set, yielding sensitivity of 75.7% and specificity of 64.9%. Finally, the model has been implemented as a web-based system, namely iDPGK, which is now freely available at http://mer.hc.mmh.org.tw/iDPGK/ .

[1]  T. Tsunoda,et al.  Bigram-PGK: phosphoglycerylation prediction using the technique of bigram probabilities of position specific scoring matrix , 2019, BMC Molecular and Cell Biology.

[2]  Justin Bo-Kai Hsu,et al.  Characterization and Identification of Lysine Succinylation Sites based on Deep Learning Method , 2019, Scientific Reports.

[3]  Yue Zhao,et al.  PTM-Logo: a program for generation of sequence logos based on position-specific background amino-acid probabilities , 2019, Bioinform..

[4]  Hemant Ishwaran,et al.  A random forests quantile classifier for class imbalanced data , 2019, Pattern Recognit..

[5]  A. Dehzangi,et al.  EvolStruct-Phogly: incorporating structural properties and evolutionary information from profile bigrams for the phosphoglycerylation prediction , 2019, BMC Genomics.

[6]  Tzong-Yi Lee,et al.  Characterization and identification of lysine glutarylation based on intrinsic interdependence between positions in the substrate sites , 2019, BMC Bioinformatics.

[7]  M. M. Hoffman,et al.  Classification and interaction in random forests , 2018, Proceedings of the National Academy of Sciences.

[8]  K. Chou,et al.  iPGK-PseAAC: Identify Lysine Phosphoglycerylation Sites in Proteins by Incorporating Four Different Tiers of Amino Acid Pairwise Coupling Information into the General PseAAC. , 2017, Medicinal chemistry (Shariqah (United Arab Emirates)).

[9]  Shunming Li,et al.  A novel feature representation method based on deep neural networks for gear fault diagnosis , 2017, 2017 Prognostics and System Health Management Conference (PHM-Harbin).

[10]  Michal Jakubczyk,et al.  A framework for sensitivity analysis of decision trees , 2017, Central European Journal of Operations Research.

[11]  Yu Xue,et al.  PLMD: An updated data resource of protein lysine modifications. , 2017, Journal of genetics and genomics = Yi chuan xue bao.

[12]  Jijun Tang,et al.  Predicting protein lysine phosphoglycerylation sites by hybridizing many sequence based features. , 2017, Molecular bioSystems.

[13]  Tzong-Yi Lee,et al.  Investigation and identification of protein carbonylation sites based on position-specific amino acid composition and physicochemical features , 2017, BMC Bioinformatics.

[14]  Hong Gu,et al.  Predicting lysine phosphoglycerylation with fuzzy SVM by incorporating k-spaced amino acid pairs into Chou׳s general PseAAC. , 2016, Journal of theoretical biology.

[15]  J. Griffiths,et al.  The application of targeted mass spectrometry-based strategies to the detection and localization of post-translational modifications. , 2015, Mass spectrometry reviews.

[16]  Yan Xu,et al.  Phogly-PseAAC: Prediction of lysine phosphoglycerylation in proteins incorporating with position-specific propensity. , 2015, Journal of theoretical biology.

[17]  Pedro Domingues,et al.  Post-translational modifications and mass spectrometry detection. , 2013, Free radical biology & medicine.

[18]  Tzong-Yi Lee,et al.  ViralPhos: incorporating a recursively statistical method to predict phosphorylation sites on virus proteins , 2013, BMC Bioinformatics.

[19]  B. Cravatt,et al.  Functional Lysine Modification by an Intrinsically Reactive Primary Glycolytic Metabolite , 2013, Science.

[20]  R. Tian,et al.  Glucose metabolism and cardiac hypertrophy. , 2011, Cardiovascular research.

[21]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[22]  Edward L. Huttlin,et al.  A Tissue-Specific Atlas of Mouse Protein Phosphorylation and Expression , 2010, Cell.

[23]  Ganapati Panda,et al.  A novel feature representation method based on Chou's pseudo amino acid composition for protein structural class prediction , 2010, Comput. Biol. Chem..

[24]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[25]  Vladimir Vacic,et al.  Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments , 2006, Bioinform..

[26]  Martin R Larsen,et al.  Analysis of posttranslational modifications of proteins by tandem mass spectrometry. , 2006, BioTechniques.

[27]  George Hripcsak,et al.  Technical Brief: Agreement, the F-Measure, and Reliability in Information Retrieval , 2005, J. Am. Medical Informatics Assoc..

[28]  Josef Kittler,et al.  Fast branch & bound algorithms for optimal feature selection , 2004, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[29]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[30]  Michael A Babyak,et al.  What You See May Not Be What You Get: A Brief, Nontechnical Introduction to Overfitting in Regression-Type Models , 2004, Psychosomatic medicine.

[31]  Minoru Kanehisa,et al.  Prediction of protein subcellular locations by support vector machines using compositions of amino acids and amino acid pairs , 2003, Bioinform..

[32]  S F Altschul,et al.  Iterated profile searches with PSI-BLAST--a tool for discovery in protein databases. , 1998, Trends in biochemical sciences.

[33]  Thomas L. Madden,et al.  Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. , 1997, Nucleic acids research.