Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework

As a newly discovered post-translational modification (PTM), lysine malonylation (Kmal) regulates a myriad of cellular processes from prokaryotes to eukaryotes and has important implications in human diseases. Despite its functional significance, computational methods to accurately identify malonylation sites are still lacking and urgently needed. In particular, there is currently no comprehensive analysis and assessment of different features and machine learning (ML) methods that are required for constructing the necessary prediction models. Here, we review, analyze and compare 11 different feature encoding methods, with the goal of extracting key patterns and characteristics from residue sequences of Kmal sites. We identify optimized feature sets, with which four commonly used ML methods (random forest, support vector machines, K-nearest neighbor and logistic regression) and one recently proposed [Light Gradient Boosting Machine (LightGBM)] are trained on data from three species, namely, Escherichia coli, Mus musculus and Homo sapiens, and compared using randomized 10-fold cross-validation tests. We show that integration of the single method-based models through ensemble learning further improves the prediction performance and model robustness on the independent test. When compared to the existing state-of-the-art predictor, MaloPred, the optimal ensemble models were more accurate for all three species (AUC: 0.930, 0.923 and 0.944 for E. coli, M. musculus and H. sapiens, respectively). Using the ensemble models, we developed an accessible online predictor, kmal-sp, available at http://kmalsp.erc.monash.edu/. We hope that this comprehensive survey and the proposed strategy for building more accurate models can serve as a useful guide for inspiring future developments of computational methods for PTM site prediction, expedite the discovery of new malonylation and other PTM types and facilitate hypothesis-driven experimental validation of novel malonylated substrates and malonylation sites.

[1]  Shuigeng Zhou,et al.  A new taxonomy-based protein fold recognition approach based on autocross-covariance transformation , 2009, Bioinform..

[2]  Li-na Wang,et al.  Computational prediction of species‐specific malonylation sites via enhanced characteristic strategy , 2016, Bioinform..

[3]  See-Kiong Ng,et al.  Systematic gene function prediction from gene expression data by using a fuzzy nearest-cluster method , 2006, BMC Bioinformatics.

[4]  Kang Chen,et al.  Computational prediction of bacterial type IV-B effectors using C-terminal signals and machine learning algorithms , 2016, 2016 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB).

[5]  陈奕欣 Ongoing and future developments at the Universal Protein Resource , 2011 .

[6]  K. Chou,et al.  Prediction of protein subcellular locations by incorporating quasi-sequence-order effect. , 2000, Biochemical and biophysical research communications.

[7]  Dong-Sheng Cao,et al.  protr/ProtrWeb: R package and web server for generating various numerical representation schemes of protein sequences , 2015, Bioinform..

[8]  Eric Y. T. Juan,et al.  Predicting Protein Subcellular Localizations for Gram-Negative Bacteria Using DP-PSSM and Support Vector Machines , 2009, 2009 International Conference on Complex, Intelligent and Software Intensive Systems.

[9]  Gajendra P.S. Raghava,et al.  Prediction of RNA binding sites in a protein using SVM and PSSM profile , 2008, Proteins.

[10]  Yingming Zhao,et al.  Metabolic Regulation by Lysine Malonylation, Succinylation, and Glutarylation* , 2015, Molecular & Cellular Proteomics.

[11]  D. Fiedler,et al.  Features and regulation of non-enzymatic post-translational modifications. , 2018, Nature chemical biology.

[12]  Isabelle Guyon,et al.  An Introduction to Variable and Feature Selection , 2003, J. Mach. Learn. Res..

[13]  D. Virshup,et al.  Post-translational modifications regulate the ticking of the circadian clock , 2007, Nature Reviews Molecular Cell Biology.

[14]  James G. Lyons,et al.  A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition. , 2013, Journal of theoretical biology.

[15]  Toto Haryanto,et al.  Combining PSSM and physicochemical feature for protein structure prediction with support vector machine , 2017 .

[16]  M. Rees,et al.  Epsilon-N-Methyl-lysine in bacterial flagellar protein. , 1959, Nature.

[17]  J. Boeke,et al.  Lysine Succinylation and Lysine Malonylation in Histones* , 2012, Molecular & Cellular Proteomics.

[18]  Yuhua Yao,et al.  Prediction of protein structural classes for low-similarity sequences using reduced PSSM and position-based secondary structural features. , 2015, Gene.

[19]  Alan Wee-Chung Liew,et al.  Predicting lysine‐malonylation sites of proteins using sequence and predicted structural features , 2018, J. Comput. Chem..

[20]  Reza Ebrahimpour,et al.  PPIevo: protein-protein interaction prediction from PSSM based evolutionary information. , 2013, Genomics.

[21]  G Schneider,et al.  The rational design of amino acid sequences by artificial neural networks and simulated molecular evolution: de novo design of an idealized leader peptidase cleavage site. , 1994, Biophysical journal.

[22]  Alex Collie,et al.  Predicting research use in a public health policy environment: results of a logistic regression analysis , 2014, Implementation Science.

[23]  Toki Saito,et al.  How can machine-learning methods assist in virtual screening for hyperuricemia? A healthcare machine-learning approach , 2016, J. Biomed. Informatics.

[24]  Yihui Liu,et al.  Prediction of protein secondary structure using SVM-PSSM Classifier combined by sequence features , 2016, 2016 IEEE Advanced Information Management, Communicates, Electronic and Automation Control Conference (IMCEC).

[25]  Daniel W. A. Buchan,et al.  A large-scale evaluation of computational protein function prediction , 2013, Nature Methods.

[26]  Y. Li,et al.  Prediction of Protein Lysine Acylation by Integrating Primary Sequence Information with Multiple Functional Features. , 2016, Journal of proteome research.

[27]  Stefan Westermann,et al.  Post-translational modifications regulate microtubule function , 2003, Nature Reviews Molecular Cell Biology.

[28]  Tie-Yan Liu,et al.  A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS 2017.

[29]  L. Johnson The regulation of protein phosphorylation. , 2009, Biochemical Society transactions.

[30]  Jiangning Song,et al.  Predicting disulfide connectivity from protein sequence using multiple sequence feature vectors and secondary structure , 2007, Bioinform..

[31]  Kuo-Chen Chou,et al.  Predicting eukaryotic protein subcellular location by fusing optimized evidence-theoretic K-Nearest Neighbor classifiers. , 2006, Journal of proteome research.

[32]  Tal Pupko,et al.  Genome-Scale Identification of Legionella pneumophila Effectors Using a Machine Learning Approach , 2009, PLoS pathogens.

[33]  Shuguang Cui,et al.  Feature selection with interactions in logistic regression models using multivariate synergies for a GWAS application , 2017, BMC Genomics.

[34]  Ling-Yun Wu,et al.  Mal-Lys: prediction of lysine malonylation sites in proteins integrated sequence-based features with mRMR feature selection , 2016, Scientific Reports.

[35]  Jing Lu,et al.  Identifying Candidates for Breast Cancer Using Interactions of Chemicals and Proteins. , 2017, Combinatorial chemistry & high throughput screening.

[36]  Gajendra P S Raghava,et al.  SVM based prediction of RNA‐binding proteins using binding residues and evolutionary information , 2011, Journal of molecular recognition : JMR.

[37]  Lingyun Zou,et al.  Accurate prediction of bacterial type IV secreted effectors using amino acid composition and PSSM profiles , 2013, Bioinform..

[38]  Raghvendra Mall,et al.  PaRSnIP: sequence-based protein solubility prediction using gradient boosting machine , 2018, Bioinform..

[39]  William Stafford Noble,et al.  Support vector machine , 2013 .

[40]  Taigang Liu,et al.  Prediction of subcellular location of apoptosis proteins combining tri-gram encoding based on PSSM and recursive feature elimination. , 2015, Journal of theoretical biology.

[41]  Huiqing Liu,et al.  Data Mining Tools for Biological Sequences , 2003, J. Bioinform. Comput. Biol..

[42]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[43]  Xue-wen Chen,et al.  On Position-Specific Scoring Matrix for Protein Function Prediction , 2011, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[44]  C. Allis,et al.  Histone acetyltransferases. , 2001, Annual review of biochemistry.

[45]  Yong Huang,et al.  In Silico Prediction of Gamma-Aminobutyric Acid Type-A Receptors Using Novel Machine-Learning-Based SVM and GBDT Approaches , 2016, BioMed research international.

[46]  Geoffrey I. Webb,et al.  Comprehensive assessment and performance improvement of effector protein predictors for bacterial secretion systems III, IV and VI , 2016, Briefings Bioinform..

[47]  Gholamreza Haffari,et al.  PREvaIL, an integrative approach for inferring catalytic residues using sequence, structural, and network features in a machine-learning framework. , 2018, Journal of theoretical biology.

[48]  Ronald J A Wanders,et al.  Proteomic and Biochemical Studies of Lysine Malonylation Suggest Its Malonic Aciduria-associated Regulatory Role in Mitochondrial Function and Fatty Acid Oxidation* , 2015, Molecular & Cellular Proteomics.

[49]  Gholamreza Haffari,et al.  PROSPERous: high-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy , 2018, Bioinform..

[50]  Rich Caruana,et al.  An empirical comparison of supervised learning algorithms , 2006, ICML.

[51]  Wei Chen,et al.  Prediction of phosphothreonine sites in human proteins by fusing different features , 2016, Scientific Reports.

[52]  Dong Xu,et al.  Musite, a Tool for Global Prediction of General and Kinase-specific Phosphorylation Sites* , 2010, Molecular & Cellular Proteomics.

[53]  Geoffrey I. Webb,et al.  PhosphoPredict: A bioinformatics tool for prediction of human kinase-specific phosphorylation substrates and sites by integrating heterogeneous feature selection , 2017, Scientific Reports.

[54]  Kaiyan Feng,et al.  Prediction of Lysine Malonylation Sites Based on Pseudo Amino Acid. , 2017, Combinatorial chemistry & high throughput screening.

[55]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[56]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[57]  Geoffrey I. Webb,et al.  Systematic analysis and prediction of type IV secreted effector proteins by machine learning approaches , 2017, Briefings Bioinform..

[58]  Kurt Hornik,et al.  Misc Functions of the Department of Statistics (e1071), TU Wien , 2014 .

[59]  Ji-Yong An,et al.  Highly accurate prediction of protein self-interactions by incorporating the average block and PSSM information into the general PseAAC. , 2017, Journal of theoretical biology.

[60]  Waddington Ch,et al.  Canalization of Development and Genetic Assimilation of Acquired Characters , 1959 .

[61]  Lukasz A. Kurgan,et al.  PFRES: protein fold classification by using evolutionary information and predicted secondary structure , 2007, Bioinform..

[62]  R. Grantham Amino Acid Difference Formula to Help Explain Protein Evolution , 1974, Science.

[63]  Vladimir Vacic,et al.  Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments , 2006, Bioinform..

[64]  Stephen Tyree,et al.  Parallel boosted regression trees for web search ranking , 2011, WWW.

[65]  Peng Xue,et al.  Lysine Malonylation Is Elevated in Type 2 Diabetic Mouse Models and Enriched in Metabolic Associated Proteins* , 2014, Molecular & Cellular Proteomics.

[66]  M. Rees,et al.  ɛ-N-Methyl-lysine in Bacterial Flagellar Protein , 1959, Nature.

[67]  Geoffrey I. Webb,et al.  POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles , 2017, Bioinform..

[68]  Ping Liu,et al.  Global Profiling of Protein Lysine Malonylation in Escherichia coli Reveals Its Role in Energy Metabolism. , 2016, Journal of proteome research.

[69]  Minoru Kanehisa,et al.  AAindex: amino acid index database, progress report 2008 , 2007, Nucleic Acids Res..

[70]  Jiangning Song,et al.  Bastion6: a bioinformatics approach for accurate prediction of type VI secreted effectors , 2018, Bioinform..

[71]  Matthew J. Rardin,et al.  SIRT5 Regulates both Cytosolic and Mitochondrial Protein Malonylation with Glycolysis as a Major Target. , 2015, Molecular cell.

[72]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[73]  Jiangning Song,et al.  Prediction of cis/trans isomerization in proteins using PSI-BLAST profiles and secondary structure information , 2006, BMC Bioinformatics.

[74]  Slobodan Vucetic,et al.  MS-kNN: protein function prediction by integrating multiple data sources , 2013, BMC Bioinformatics.

[75]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[76]  Robert M. Haralick,et al.  Feature normalization and likelihood-based similarity measures for image retrieval , 2001, Pattern Recognit. Lett..

[77]  Yi Zhang,et al.  The First Identification of Lysine Malonylation Substrates and Its Regulatory Enzyme* , 2011, Molecular & Cellular Proteomics.

[78]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[79]  Xing Chen,et al.  Accurate prediction of protein-protein interactions by integrating potential evolutionary information embedded in PSSM profile and discriminative vector machine classifier , 2017, Oncotarget.

[80]  Ravinder Singh,et al.  Fast-Find: A novel computational approach to analyzing combinatorial motifs , 2006, BMC Bioinformatics.