XG-ac4C: identification of N4-acetylcytidine (ac4C) in mRNA using eXtreme gradient boosting with electron-ion interaction pseudopotentials

N4-acetylcytidine (ac4C) is a post-transcriptional modification in mRNA which plays a major role in the stability and regulation of mRNA translation. The working mechanism of ac4C modification in mRNA is still unclear and traditional laboratory experiments are time-consuming and expensive. Therefore, we propose an XG-ac4C machine learning model based on the eXtreme Gradient Boost classifier for the identification of ac4C sites. The XG-ac4C model uses a combination of electron-ion interaction pseudopotentials and electron-ion interaction pseudopotentials of trinucleotide of the nucleotides in ac4C sites. Moreover, Shapley additive explanations and local interpretable model-agnostic explanations are applied to understand the importance of features and their contribution to the final prediction outcome. The obtained results demonstrate that XG-ac4C outperforms existing state-of-the-art methods. In more detail, the proposed model improves the area under the precision-recall curve by 9.4% and 9.6% in cross-validation and independent tests, respectively. Finally, a user-friendly web server based on the proposed model for ac4C site identification is made freely available at http://nsclbio.jbnu.ac.kr/tools/xgac4c/.

[1]  Kil To Chong,et al.  A deep learning-based computational approach for discrimination of DNA N6-methyladenosine sites by fusing heterogeneous features , 2020 .

[2]  Wei Chen,et al.  i6mA-Pred: identifying DNA N6-methyladenine sites in the rice genome , 2019, Bioinform..

[3]  Xuesong Feng,et al.  Role of N6-methyladenosine modification in cancer. , 2018, Current opinion in genetics & development.

[4]  Muhammad Tahir,et al.  PSOFuzzySVM-TMH: identification of transmembrane helix segments using ensemble feature space by incorporated fuzzy support vector machine. , 2015, Molecular bioSystems.

[5]  Yanchun Liang,et al.  LncFinder: an integrated platform for long non-coding RNA identification utilizing sequence intrinsic composition, structural information and physicochemical property , 2018, Briefings Bioinform..

[6]  Fabrício Martins Lopes,et al.  Feature Extraction of Long Non-coding RNAs: A Fourier and Numerical Mapping Approach , 2019, CIARP.

[7]  Chaoxian Zhang,et al.  Selection of reference genes for qPCR normalization in buffalobur (Solanum rostratum Dunal) , 2019, Scientific Reports.

[8]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[9]  Kil To Chong,et al.  Identifying Enhancers and Their Strength by the Integration of Word Embedding and Convolution Neural Network , 2020, IEEE Access.

[10]  Wei Chen,et al.  iMRM: a platform for simultaneously identifying multiple kinds of RNA modifications , 2020, Bioinform..

[11]  Shiwei Duan,et al.  The Processing, Gene Regulation, Biological Functions, and Clinical Relevance of N4-Acetylcytidine on RNA: A Systematic Review , 2020, Molecular therapy. Nucleic acids.

[12]  Kil To Chong,et al.  Prediction of N6-methyladenosine sites using convolution neural network model based on distributed feature representations , 2020, Neural Networks.

[13]  Aldenor G. Santos,et al.  Occurrence of the potent mutagens 2- nitrobenzanthrone and 3-nitrobenzanthrone in fine airborne particles , 2019, Scientific Reports.

[14]  De-Shuang Huang,et al.  iEnhancer‐EL: identifying enhancers and their strength with ensemble learning approach , 2018, Bioinform..

[15]  A. Nair,et al.  A coding measure scheme employing electron-ion interaction pseudopotential (EIIP) , 2006, Bioinformation.

[16]  Wanqing Zhao,et al.  PACES: prediction of N4-acetylcytidine (ac4C) modification sites in mRNA , 2019, Scientific Reports.

[17]  Cheng Peng,et al.  Novel naïve Bayes classification models for predicting the carcinogenicity of chemicals. , 2016, Food and chemical toxicology : an international journal published for the British Industrial Biological Research Association.

[18]  Kil To Chong,et al.  Improving the Quantification of DNA Sequences Using Evolutionary Information Based on Deep Learning , 2019, Cells.

[19]  K. Chou,et al.  iDNA6mA-PseKNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. , 2018, Genomics.

[20]  Maqsood Hayat,et al.  iNuc-STNC: a sequence-based predictor for identification of nucleosome positioning in genomes by extending the concept of SAAC and Chou's PseAAC. , 2016, Molecular bioSystems.

[21]  Ran Su,et al.  M6APred-EL: A Sequence-Based Predictor for Identifying N6-methyladenosine Sites Using Ensemble Learning , 2018, Molecular therapy. Nucleic acids.

[22]  Hilal Tayara,et al.  Improved Predicting of The Sequence Specificities of RNA Binding Proteins by Deep Learning , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[23]  Kil To Chong,et al.  Identification of promoters and their strength using deep learning , 2019 .

[24]  Janusz M. Bujnicki,et al.  MODOMICS: a database of RNA modification pathways. 2017 update , 2017, Nucleic Acids Res..

[25]  Syed Danish Ali,et al.  A CNN-Based RNA N6-Methyladenosine Site Predictor for Multiple Species Using Heterogeneous Features Representation , 2020, IEEE Access.

[26]  Xinping Xiao,et al.  A classification model for lncRNA and mRNA based on k-mers and a convolutional neural network , 2019, BMC Bioinformatics.

[27]  Hilal Tayara,et al.  iPseU-CNN: Identifying RNA Pseudouridine Sites Using Convolutional Neural Networks , 2019, Molecular therapy. Nucleic acids.

[28]  Kil To Chong,et al.  Convolutional neural networks for discrimination of RNA pseudouridine sites , 2019 .

[29]  D. Cox The Regression Analysis of Binary Sequences , 1958 .

[30]  Syed Danish Ali,et al.  iIM-CNN: Intelligent Identifier of 6mA Sites on Different Species by Using Convolution Neural Network , 2019, IEEE Access.

[31]  Davide Chicco,et al.  Ten quick tips for machine learning in computational biology , 2017, BioData Mining.

[32]  Ran Su,et al.  M6AMRFS: Robust Prediction of N6-Methyladenosine Sites With Sequence-Based Features in Multiple Species , 2018, Front. Genet..

[33]  David Sturgill,et al.  Acetylation of Cytidine in mRNA Promotes Translation Efficiency , 2018, Cell.

[34]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[35]  Bin Huang,et al.  Opening the black box of neural networks: methods for interpreting neural network models in clinical applications. , 2018, Annals of translational medicine.

[36]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[37]  Kil To Chong,et al.  iSS-CNN: Identifying splicing sites using convolution neural network , 2019, Chemometrics and Intelligent Laboratory Systems.

[38]  Calum MacAulay,et al.  Opening the Black Box: the Relationship between Neural Networks and Linear Discriminant Functions , 1997, Analytical cellular pathology : the journal of the European Society for Analytical Cellular Pathology.

[39]  K. Entian,et al.  Yeast Kre33 and human NAT10 are conserved 18S rRNA cytosine acetyltransferases that modify tRNAs assisted by the adaptor Tan1/THUMPD1 , 2015, Nucleic acids research.