RF-MaloSite and DL-Malosite: Methods based on random forest and deep learning to identify malonylation sites

Malonylation, which has recently emerged as an important lysine modification, regulates diverse biological activities and has been implicated in several pervasive disorders, including cardiovascular disease and cancer. However, conventional global proteomics analysis using tandem mass spectrometry can be time-consuming, expensive and technically challenging. Therefore, to complement and extend existing experimental methods for malonylation site identification, we developed two novel computational methods for malonylation site prediction based on random forest and deep learning machine learning algorithms, RF-MaloSite and DL-MaloSite, respectively. DL-MaloSite requires the primary amino acid sequence as an input and RF-MaloSite utilizes a diverse set of biochemical, physiochemical and sequence-based features. While systematic assessment of performance metrics suggests that both ‘RF-MaloSite’ and ‘DL-MaloSite’ perform well in all metrics tested, our methods perform particularly well in the areas of accuracy, sensitivity and overall method performance (assessed by the Matthew’s Correlation Coefficient). For instance, RF-MaloSite exhibited MCC scores of 0.42 and 0.40 using 10-fold cross-validation and an independent test set, respectively. Meanwhile, DL-MaloSite was characterized by MCC scores of 0.51 and 0.49 based on 10-fold cross-validation and an independent set, respectively. Importantly, both methods exhibited efficiency scores that were on par or better than those achieved by existing malonylation site prediction methods. The identification of these sites may also provide important insights into the mechanisms of crosstalk between malonylation and other lysine modifications, such as acetylation, glutarylation and succinylation. To facilitate their use, both methods have been made freely available to the research community at https://github.com/dukkakc/DL-MaloSite-and-RF-MaloSite.

[1]  Hiroto Saigo,et al.  RF-GlutarySite: a random forest based predictor for glutarylation sites. , 2019, Molecular omics.

[2]  Lei Deng,et al.  PredCSO: an ensemble method for the prediction of S-sulfenylation sites in proteins. , 2018, Molecular omics.

[3]  Dongsup Kim,et al.  PostMod: sequence based prediction of kinase-specific phosphorylation sites with indirect relationship , 2010, BMC Bioinformatics.

[4]  Hamid D. Ismail,et al.  RF-Phos: A Novel General Phosphorylation Site Prediction Tool Based on Random Forest , 2016, BioMed research international.

[5]  Zhen Xu,et al.  Lysine Malonylome May Affect the Central Metabolism and Erythromycin Biosynthesis Pathway in Saccharopolyspora erythraea. , 2016, Journal of proteome research.

[6]  Y. Fujiwara,et al.  Prediction of subcellular localizations using amino acid composition and order. , 2001, Genome informatics. International Conference on Genome Informatics.

[7]  Xing-Ming Zhao,et al.  DeepPhos: prediction of protein phosphorylation sites with deep learning , 2019, Bioinform..

[8]  Wes McKinney,et al.  Data Structures for Statistical Computing in Python , 2010, SciPy.

[9]  S. Ranganathan,et al.  PhoglyStruct: Prediction of phosphoglycerylated lysine residues using structural properties of amino acids , 2018, Scientific Reports.

[10]  Elijah MacCarthy,et al.  Advances in Protein Super-Secondary Structure Prediction and Application to Protein Structure Prediction. , 2019, Methods in molecular biology.

[11]  Xiang David Li,et al.  A chemical probe for lysine malonylation. , 2013, Angewandte Chemie.

[12]  Yu Xue,et al.  CPLM: a database of protein lysine modifications , 2013, Nucleic Acids Res..

[13]  Li-na Wang,et al.  Computational prediction of species‐specific malonylation sites via enhanced characteristic strategy , 2016, Bioinform..

[14]  Yingming Zhao,et al.  Metabolic Regulation by Lysine Malonylation, Succinylation, and Glutarylation* , 2015, Molecular & Cellular Proteomics.

[15]  Jiangning Song,et al.  Computational characterization of parallel dimeric and trimeric coiled-coils using effective amino acid indices. , 2015, Molecular bioSystems.

[16]  Yu Liu,et al.  Prediction of Protein-Protein Interaction Sites Based on Naive Bayes Classifier , 2015, Biochemistry research international.

[17]  Adrian Barbu,et al.  Feature Selection with Annealing for Computer Vision and Big Data Learning , 2013, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[18]  Yuehui Chen,et al.  LipoFNT: Lipoylation Sites Identification with Flexible Neural Tree , 2019, Complex..

[19]  J. Boeke,et al.  Lysine Succinylation and Lysine Malonylation in Histones* , 2012, Molecular & Cellular Proteomics.

[20]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[21]  Mikael Bodén,et al.  Prediction of protein continuum secondary structure with probabilistic models based on NMR solved structures , 2006, BMC Bioinformatics.

[22]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[23]  Geoffrey I. Webb,et al.  Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework , 2018, Briefings Bioinform..

[24]  Ling-Yun Wu,et al.  Mal-Lys: prediction of lysine malonylation sites in proteins integrated sequence-based features with mRMR feature selection , 2016, Scientific Reports.

[25]  Taeho Jo,et al.  A Multiple Resampling Method for Learning from Imbalanced Data Sets , 2004, Comput. Intell..

[26]  C. Olsen,et al.  Expansion of the lysine acylation landscape. , 2012, Angewandte Chemie.

[27]  Peng Xue,et al.  Lysine Malonylation Is Elevated in Type 2 Diabetic Mouse Models and Enriched in Metabolic Associated Proteins* , 2014, Molecular & Cellular Proteomics.

[28]  R. Polikar,et al.  Ensemble based systems in decision making , 2006, IEEE Circuits and Systems Magazine.

[29]  Ping Liu,et al.  Global Profiling of Protein Lysine Malonylation in Escherichia coli Reveals Its Role in Energy Metabolism. , 2016, Journal of proteome research.

[30]  Kuo-Chen Chou,et al.  pLoc-mGpos: Incorporate Key Gene Ontology Information into General PseAAC for Predicting Subcellular Localization of Gram-Positive Bacterial Proteins , 2017 .

[31]  Matthew J. Rardin,et al.  SIRT5 Regulates both Cytosolic and Mitochondrial Protein Malonylation with Glycolysis as a Major Target. , 2015, Molecular cell.

[32]  Yanchun Liang,et al.  MusiteDeep: a deep‐learning framework for general and kinase‐specific phosphorylation site prediction , 2017, Bioinform..

[33]  Ziding Zhang,et al.  Prediction of mucin-type O-glycosylation sites in mammalian proteins using the composition of k-spaced amino acid pairs , 2008, BMC Bioinformatics.

[34]  Yasset Perez-Riverol,et al.  Open source libraries and frameworks for biological data visualisation: A guide for developers , 2015, Proteomics.

[35]  Hiroyuki Kurata,et al.  iLMS, Computational Identification of Lysine-Malonylation Sites by Combining Multiple Sequence Features , 2018, 2018 IEEE 18th International Conference on Bioinformatics and Bioengineering (BIBE).

[36]  Oliver Brock,et al.  EPSILON-CP: using deep learning to combine information from multiple sources for protein contact prediction , 2017, BMC Bioinformatics.

[37]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[38]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[39]  Tzong-Yi Lee,et al.  SOHSite: incorporating evolutionary information and physicochemical properties to identify protein S-sulfenylation sites , 2016, BMC Genomics.

[40]  Zhen Chen,et al.  Integration of A Deep Learning Classifier with A Random Forest Approach for Predicting Malonylation Sites , 2018, Genom. Proteom. Bioinform..

[41]  D. Figeys,et al.  The functional diversity of protein lysine methylation , 2014, Molecular systems biology.

[42]  I. Gutman,et al.  Protein Sequence Comparison Based on Physicochemical Properties and the Position-Feature Energy Matrix , 2017, Scientific Reports.

[43]  Hiroto Saigo,et al.  CNN-BLPred: a Convolutional neural network based predictor for β-Lactamases (BL) and their classes , 2017, BMC Bioinformatics.

[44]  Robert H. Newman,et al.  SVM-SulfoSite: A support vector machine based predictor for sulfenylation sites , 2018, Scientific Reports.

[45]  Ronald J A Wanders,et al.  Proteomic and Biochemical Studies of Lysine Malonylation Suggest Its Malonic Aciduria-associated Regulatory Role in Mitochondrial Function and Fatty Acid Oxidation* , 2015, Molecular & Cellular Proteomics.

[46]  K. Chou,et al.  pLoc-mEuk: Predict subcellular localization of multi-label eukaryotic proteins by extracting the key GO information into general PseAAC. , 2018, Genomics.

[47]  Yongtang Shi,et al.  Erratum: Protein Sequence Comparison Based on Physicochemical Properties and the Position-Feature Energy Matrix , 2017, Scientific Reports.

[48]  Kaiyan Feng,et al.  Prediction of Lysine Malonylation Sites Based on Pseudo Amino Acid. , 2017, Combinatorial chemistry & high throughput screening.

[49]  Chi-Wei Chen,et al.  iStable: off-the-shelf predictor integration for predicting protein stability changes , 2013, BMC Bioinformatics.

[50]  Yuehui Chen,et al.  Somatic mutation detection using ensemble of flexible neural tree model , 2016, Neurocomputing.

[51]  Daniel Svozil,et al.  FAME 2: Simple and Effective Machine Learning Model of Cytochrome P450 Regioselectivity , 2017, J. Chem. Inf. Model..

[52]  Lior Rokach,et al.  Ensemble-based classifiers , 2010, Artificial Intelligence Review.

[53]  Alan Wee-Chung Liew,et al.  Predicting lysine‐malonylation sites of proteins using sequence and predicted structural features , 2018, J. Comput. Chem..

[54]  Yi Zhang,et al.  The First Identification of Lysine Malonylation Substrates and Its Regulatory Enzyme* , 2011, Molecular & Cellular Proteomics.

[55]  Yuehui Chen,et al.  Reverse engineering of gene regulatory networks using flexible neural tree models , 2013, Neurocomputing.

[56]  Kyungsook Han,et al.  Mutli-Features Prediction of Protein Translational Modification Sites , 2018, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[57]  Hening Lin,et al.  Protein lysine acylation and cysteine succination by intermediates of energy metabolism. , 2012, ACS chemical biology.

[58]  Jian Peng,et al.  Protein Secondary Structure Prediction Using Deep Convolutional Neural Fields , 2015, Scientific Reports.

[59]  David Saggerson,et al.  Malonyl-CoA, a key signaling molecule in mammalian cells. , 2008, Annual review of nutrition.

[60]  Y. Li,et al.  Prediction of Protein Lysine Acylation by Integrating Primary Sequence Information with Multiple Functional Features. , 2016, Journal of proteome research.

[61]  Zhen Chen,et al.  SUMOhydro: A Novel Method for the Prediction of Sumoylation Sites Based on Hydrophobic Properties , 2012, PloS one.

[62]  Shao-Ping Shi,et al.  PLMLA: prediction of lysine methylation and lysine acetylation by combining multiple features. , 2012, Molecular bioSystems.

[63]  Yong-Zi Chen,et al.  Prediction of Ubiquitination Sites by Using the Composition of k-Spaced Amino Acid Pairs , 2011, PloS one.

[64]  Geoffrey I. Webb,et al.  Accurate in silico identification of species-specific acetylation sites by integrating protein sequence-derived and functional features , 2014, Scientific Reports.

[65]  Jiangning Song,et al.  Structural Propensities of Human Ubiquitination Sites: Accessibility, Centrality and Local Conformation , 2013, PloS one.

[66]  Eric Verdin,et al.  Mitochondrial sirtuins: regulators of protein acylation and metabolism , 2012, Trends in Endocrinology & Metabolism.

[67]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[68]  Abdollah Dehzangi,et al.  GlyStruct: glycation prediction using structural properties of amino acid residues , 2019, BMC Bioinformatics.

[69]  Jinyan Li,et al.  Computational Identification of Protein Pupylation Sites by Using Profile-Based Composition of k-Spaced Amino Acid Pairs , 2015, PloS one.

[70]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[71]  Jianding Qiu,et al.  Systematic Analysis and Prediction of Pupylation Sites in Prokaryotic Proteins , 2013, PloS one.

[72]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..