Mal-Light: Enhancing Lysine Malonylation Sites Prediction Problem Using Evolutionary-based Features

Post Translational Modification (PTM) is considered an important biological process with a tremendous impact on the function of proteins in both eukaryotes, and prokaryotes cells. During the past decades, a wide range of PTMs has been identified. Among them, malonylation is a recently identified PTM which plays a vital role in a wide range of biological interactions. Notwithstanding, this modification plays a potential role in energy metabolism in different species including Homo Sapiens. The identification of PTM sites using experimental methods is time-consuming and costly. Hence, there is a demand for introducing fast and cost-effective computational methods. In this study, we propose a new machine learning method, called Mal-Light, to address this problem. To build this model, we extract local evolutionary-based information according to the interaction of neighboring amino acids using a bi-peptide based method. We then use Light Gradient Boosting (LightGBM) as our classifier to predict malonylation sites. Our results demonstrate that Mal-Light is able to significantly improve malonylation site prediction performance compared to previous studies found in the literature. Using Mal-Light we achieve Matthew’s correlation coefficient (MCC) of 0.74 and 0.60, Accuracy of 86.66% and 79.51%, Sensitivity of 78.26% and 67.27%, and Specificity of 95.05% and 91.75%, for Homo Sapiens and Mus Musculus proteins, respectively. Mal-Light is implemented as an online predictor which is publicly available at: (http://brl.uiu.ac.bd/MalLight/)

[1]  Yasen Jiao,et al.  Performance measures in evaluating machine learning based bioinformatics predictors for classifications , 2016, Quantitative Biology.

[2]  T. Arnesen,et al.  The world of protein acetylation. , 2016, Biochimica et biophysica acta.

[3]  Wei Gu,et al.  p53 post-translational modification: deregulated in tumorigenesis. , 2010, Trends in molecular medicine.

[4]  Kuo-Chen Chou,et al.  pSuc-Lys: Predict lysine succinylation sites in proteins with PseAAC and ensemble random forest approach. , 2016, Journal of theoretical biology.

[5]  K. Chou,et al.  iRNA-PseColl: Identifying the Occurrence Sites of Different RNA Modifications by Incorporating Collective Effects of Nucleotides into PseKNC , 2017, Molecular therapy. Nucleic acids.

[6]  L. Johnson The regulation of protein phosphorylation. , 2009, Biochemical Society transactions.

[7]  Abdollah Dehzangi,et al.  Solving protein fold prediction problem using fusion of heterogeneous classifiers , 2011 .

[8]  The UniProt Consortium,et al.  UniProt: a worldwide hub of protein knowledge , 2018, Nucleic Acids Res..

[9]  Kara Dolinski,et al.  BioGRID: A Resource for Studying Biological Interactions in Yeast. , 2016, Cold Spring Harbor protocols.

[10]  N Sarkar,et al.  The methylation of lysine residues in protein. , 1966, The Journal of biological chemistry.

[11]  Abdollah Dehzangi,et al.  Improving succinylation prediction accuracy by incorporating the secondary structure via helix, strand and coil, and evolutionary information from profile bigrams , 2018, PloS one.

[12]  Ling-Yun Wu,et al.  Mal-Lys: prediction of lysine malonylation sites in proteins integrated sequence-based features with mRMR feature selection , 2016, Scientific Reports.

[13]  Dong Xu,et al.  iPhos‐PseEvo: Identifying Human Phosphorylated Proteins by Incorporating Evolutionary Information into General PseAAC via Grey System Theory , 2017, Molecular informatics.

[14]  Alan Wee-Chung Liew,et al.  Predicting lysine‐malonylation sites of proteins using sequence and predicted structural features , 2018, J. Comput. Chem..

[15]  S.M. Shovan,et al.  Prediction of Lysine Glycation PTM site in Protein using Peptide Sequence Evolution based Features , 2019, 2019 International Conference on Electrical, Computer and Communication Engineering (ECCE).

[16]  Kuo-Chen Chou,et al.  An Unprecedented Revolution in Medicinal Chemistry Driven by the Progress of Biological Science. , 2017, Current topics in medicinal chemistry.

[17]  Ren Long,et al.  iRSpot-EL: identify recombination spots with an ensemble learning approach , 2017, Bioinform..

[18]  Geoffrey I. Webb,et al.  Computational analysis and prediction of lysine malonylation sites by exploiting informative features in an integrative machine-learning framework , 2018, Briefings Bioinform..

[19]  Philipp Mitteroecker,et al.  Linear Discrimination, Ordination, and the Visualization of Selection Gradients in Modern Morphometrics , 2011, Evolutionary Biology.

[20]  Subhadip Basu,et al.  PhospredRF: Prediction of protein phosphorylation sites using a consensus of random forest classifiers , 2015, 2015 International Conference and Workshop on Computing and Communication (IEMCON).

[21]  Thomas L. Madden,et al.  Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. , 2001, Nucleic acids research.

[22]  Md. Mehedi Hasan,et al.  predSucc-Site: Lysine Succinylation Sites Prediction in Proteins by using Support Vector Machine and Resolving Data Imbalance Issue , 2018 .

[23]  Sen-Lin Tang,et al.  Taxonomy based performance metrics for evaluating taxonomic assignment methods , 2019, BMC Bioinformatics.

[24]  Moinuddin,et al.  Glycated Lysine Residues: A Marker for Non-Enzymatic Protein Glycation in Age-Related Diseases , 2011, Disease markers.

[25]  K. Chou,et al.  iNitro-Tyr: Prediction of Nitrotyrosine Sites in Proteins with General Pseudo Amino Acid Composition , 2014, PloS one.

[26]  James G. Lyons,et al.  A feature extraction technique using bi-gram probabilities of position specific scoring matrix for protein fold recognition. , 2013, Journal of theoretical biology.

[27]  Reza Ebrahimpour,et al.  LocFuse: human protein-protein interaction prediction via classifier fusion using protein localization information. , 2014, Genomics.

[28]  Kaiyan Feng,et al.  Prediction of Lysine Malonylation Sites Based on Pseudo Amino Acid. , 2017, Combinatorial chemistry & high throughput screening.

[29]  Peng Xue,et al.  Lysine Malonylation Is Elevated in Type 2 Diabetic Mouse Models and Enriched in Metabolic Associated Proteins* , 2014, Molecular & Cellular Proteomics.

[30]  Alan Wee-Chung Liew,et al.  Sequence‐based prediction of protein–peptide binding sites using support vector machine , 2016, J. Comput. Chem..

[31]  Ying Gao,et al.  Bioinformatics Applications Note Sequence Analysis Cd-hit Suite: a Web Server for Clustering and Comparing Biological Sequences , 2022 .

[32]  Matthew J. Rardin,et al.  SIRT5 Regulates both Cytosolic and Mitochondrial Protein Malonylation with Glycolysis as a Major Target. , 2015, Molecular cell.

[33]  Shao-Ping Shi,et al.  The prediction of palmitoylation site locations using a multiple feature extraction method. , 2013, Journal of molecular graphics & modelling.

[34]  D. Virshup,et al.  Post-translational modifications regulate the ticking of the circadian clock , 2007, Nature Reviews Molecular Cell Biology.

[35]  Kuo-Chen Chou,et al.  iPPBS-Opt: A Sequence-Based Ensemble Classifier for Identifying Protein-Protein Binding Sites by Optimizing Imbalanced Training Datasets , 2016, Molecules.

[36]  Hamid D. Ismail,et al.  RF-Hydroxysite: a random forest based predictor for hydroxylation sites. , 2016, Molecular bioSystems.

[37]  Zhen Chen,et al.  Integration of A Deep Learning Classifier with A Random Forest Approach for Predicting Malonylation Sites , 2018, Genom. Proteom. Bioinform..

[38]  T. Kouzarides Chromatin Modifications and Their Function , 2007, Cell.

[39]  Charles Buck,et al.  Current Status of Computational Approaches for Protein Identification Using Tandem Mass Spectra , 2007 .

[40]  James G. Lyons,et al.  Gram-positive and Gram-negative protein subcellular localization by incorporating evolutionary-based descriptors into Chou׳s general PseAAC. , 2015, Journal of theoretical biology.

[41]  Kuldip K. Paliwal,et al.  Protein Fold Recognition Using an Overlapping Segmentation Approach and a Mixture of Feature Extraction Models , 2013, Australasian Conference on Artificial Intelligence.

[42]  Kuo-Chen Chou,et al.  iPhos-PseEn: Identifying phosphorylation sites in proteins by fusing different pseudo components into an ensemble classifier , 2016, Oncotarget.

[43]  Liwen Liu,et al.  LipoSVM: Prediction of Lysine Lipoylation in Proteins based on the Support Vector Machine , 2019, Current genomics.

[44]  K. Chou,et al.  iMethyl-PseAAC: Identification of Protein Methylation Sites via a Pseudo Amino Acid Composition Approach , 2014, BioMed research international.

[45]  Robert H. Newman,et al.  SVM-SulfoSite: A support vector machine based predictor for sulfenylation sites , 2018, Scientific Reports.

[46]  Wei Chen,et al.  iRNA-AI: identifying the adenosine to inosine editing sites in RNA sequences , 2016, Oncotarget.

[47]  Md. Al Mehedi Hasan,et al.  mLysPTMpred: Multiple Lysine PTM Site Prediction Using Combination of SVM with Resolving Data Imbalance Issue , 2018 .

[48]  Shi-Yun Wang,et al.  Prediction of lysine formylation sites using the composition of k-spaced amino acid pairs via Chou's 5-steps rule and general pseudo components. , 2019, Genomics.

[49]  Li Zhang,et al.  pSuc-PseRat: Predicting Lysine Succinylation in Proteins by Exploiting the Ratios of Sequence Coupling and Properties , 2017, J. Comput. Biol..

[50]  Kuldip K. Paliwal,et al.  Enhancing Protein Fold Prediction Accuracy Using Evolutionary and Structural Features , 2013, PRIB.

[51]  Yan-Ping Zhang,et al.  Cluster-based majority under-sampling approaches for class imbalance learning , 2010, 2010 2nd IEEE International Conference on Information and Financial Engineering.

[52]  Cangzhi Jia,et al.  S-SulfPred: A sensitive predictor to capture S-sulfenylation sites based on a resampling one-sided selection undersampling-synthetic minority oversampling technique. , 2017, Journal of theoretical biology.

[53]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[54]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[55]  D. Fiedler,et al.  Features and regulation of non-enzymatic post-translational modifications. , 2018, Nature chemical biology.

[56]  J. Boeke,et al.  Lysine Succinylation and Lysine Malonylation in Histones* , 2012, Molecular & Cellular Proteomics.

[57]  Yuehui Chen,et al.  K_net: Lysine Malonylation Sites Identification With Neural Network , 2020, IEEE Access.

[58]  De-Shuang Huang,et al.  IMKPse: Identification of Protein Malonylation Sites by the Key Features Into General PseAAC , 2019, IEEE Access.

[59]  K. Chou,et al.  iCar-PseCp: identify carbonylation sites in proteins by Monte Carlo sampling and incorporating sequence coupled effects into general PseAAC , 2016, Oncotarget.

[60]  Kuo-Chen Chou,et al.  iPTM-mLys: identifying multiple lysine PTM sites and their different types , 2016, Bioinform..

[61]  Cyrus Martin,et al.  The diverse functions of histone lysine methylation , 2005, Nature Reviews Molecular Cell Biology.

[62]  Li-na Wang,et al.  Computational prediction of species‐specific malonylation sites via enhanced characteristic strategy , 2016, Bioinform..

[63]  Tao Huang,et al.  Identifying the Characteristics of the Hypusination Sites Using SMOTE and SVM Algorithm with Feature Selection , 2017 .

[64]  Pierre Thibault,et al.  Large-scale analysis of lysine SUMOylation by SUMO remnant immunoaffinity profiling , 2014, Nature Communications.

[65]  Abdollah Dehzangi,et al.  A Combination of Feature Extraction Methods with an Ensemble of Different Classifiers for Protein Structural Class Prediction Problem , 2013, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[66]  T. Tsunoda,et al.  PSSM-Suc: Accurately predicting succinylation using position specific scoring matrix into bigram for feature extraction. , 2017, Journal of theoretical biology.

[67]  K. Chou,et al.  iPGK-PseAAC: Identify Lysine Phosphoglycerylation Sites in Proteins by Incorporating Four Different Tiers of Amino Acid Pairwise Coupling Information into the General PseAAC. , 2017, Medicinal chemistry (Shariqah (United Arab Emirates)).

[68]  K. Chou,et al.  iUbiq-Lys: prediction of lysine ubiquitination sites in proteins by extracting sequence evolution information via a gray system model , 2015, Journal of biomolecular structure & dynamics.

[69]  T. Tsunoda,et al.  SucStruct: Prediction of succinylated lysine residues by using structural properties of amino acids. , 2017, Analytical biochemistry.

[70]  Xiang David Li,et al.  A chemical probe for lysine malonylation. , 2013, Angewandte Chemie.

[71]  K. Chou Some remarks on protein attribute prediction and pseudo amino acid composition , 2010, Journal of Theoretical Biology.

[72]  Yu Xue,et al.  PLMD: An updated data resource of protein lysine modifications. , 2017, Journal of genetics and genomics = Yi chuan xue bao.

[73]  Kuo-Chen Chou,et al.  2L-piRNA: A Two-Layer Ensemble Classifier for Identifying Piwi-Interacting RNAs and Their Function , 2017, Molecular therapy. Nucleic acids.

[74]  Cheng Chen,et al.  LightGBM-PPI: Predicting protein-protein interactions through LightGBM with multi-information fusion , 2019, Chemometrics and Intelligent Laboratory Systems.

[75]  Abdollah Dehzangi,et al.  iDNAProt-ES: Identification of DNA-binding Proteins Using Evolutionary and Structural Features , 2017, Scientific Reports.

[76]  David Saggerson,et al.  Malonyl-CoA, a key signaling molecule in mammalian cells. , 2008, Annual review of nutrition.

[77]  Y. Li,et al.  Prediction of Protein Lysine Acylation by Integrating Primary Sequence Information with Multiple Functional Features. , 2016, Journal of proteome research.

[78]  Stefan Westermann,et al.  Post-translational modifications regulate microtubule function , 2003, Nature Reviews Molecular Cell Biology.

[79]  Kuo-Chen Chou,et al.  iSuc-PseOpt: Identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset. , 2016, Analytical biochemistry.

[80]  S. Ranganathan,et al.  PhoglyStruct: Prediction of phosphoglycerylated lysine residues using structural properties of amino acids , 2018, Scientific Reports.

[81]  Abdollah Dehzangi,et al.  iProtGly‐SS: Identifying protein glycation sites using sequence and structure based features , 2018, Proteins.

[82]  Zhihong Zhang,et al.  Identification of lysine succinylation as a new post-translational modification. , 2011, Nature chemical biology.