Intelligent and robust computational prediction model for DNA N4-methylcytosine sites via natural language processing

Abstract DNA N4-methylcytosine (4 ​mC) is an essential epigenetic modification and performs crucial roles in restriction-modification systems. The 4 ​mC involves many essential cellular processes, namely: correcting DNA replication and controlling DNA replication errors in the prokaryotic organism. In order to understand their biological functional mechanisms, the prediction of 4 ​mC modification is indispensable. Although computationally, it was targeted but the desired outcomes were not obtained. Thus, the development of an intelligent computational prediction system that truly expresses 4 ​mC modification sites is imperative. An efficient and high throughput discriminative intelligent computational system called “iDNA-4mC-DL” is introduced using the natural language processing method “word2vec” along with a convolution neural network. The obtained outcomes authenticated that the proposed iDNA-4mC-DL system performs outstandingly on six publicly available benchmark and independent datasets compared to current tools. It is, thus, highly estimated that the proposed model might be a more supportive and applied tool for rudimentary research and academia.

[1]  G. Walker,et al.  Managing DNA polymerases: Coordinating DNA replication, DNA repair, and DNA recombination , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[2]  Zhi Xie,et al.  MethSMRT: an integrative database for DNA N6-methyladenine and N4-methylcytosine generated by single-molecular real-time sequencing , 2016, Nucleic Acids Res..

[3]  Kil To Chong,et al.  Prediction of N6-methyladenosine sites using convolution neural network model based on distributed feature representations , 2020, Neural Networks.

[4]  Kil To Chong,et al.  iN6-Methyl (5-step): Identifying RNA N6-methyladenosine sites using deep learning mode via Chou's 5-step rules and Chou's general PseKNC , 2019, Chemometrics and Intelligent Laboratory Systems.

[5]  Leyi Wei,et al.  Meta-4mCpred: A Sequence-Based Meta-Predictor for Accurate DNA 4mC Site Prediction Using Effective Feature Representation , 2019, Molecular therapy. Nucleic acids.

[6]  K. Chou Prediction of protein cellular attributes using pseudo‐amino acid composition , 2001, Proteins.

[7]  Preeti Rathi,et al.  Selective recognition of N4-methylcytosine in DNA by engineered transcription-activator-like effectors , 2018, Philosophical Transactions of the Royal Society B: Biological Sciences.

[8]  Kil To Chong,et al.  A convolution neural network-based computational model to identify the occurrence sites of various RNA modifications by fusing varied features , 2021 .

[9]  E. Jones,et al.  Multi-pronged approach to human mesenchymal stromal cells senescence quantification with a focus on label-free methods , 2021, Scientific reports.

[10]  Kil To Chong,et al.  4mCCNN: Identification of N4-Methylcytosine Sites in Prokaryotes Using Convolutional Neural Network , 2019, IEEE Access.

[11]  Maria Strazzullo,et al.  DNA methylation 40 years later: Its role in human health and disease , 2005, Journal of cellular physiology.

[12]  Ran Su,et al.  Iterative feature representations improve N4-methylcytosine site prediction , 2019, Bioinform..

[13]  Hilal Tayara,et al.  Deep Learning Models Based on Distributed Feature Representations for Alternative Splicing Prediction , 2018, IEEE Access.

[14]  Syed Danish Ali,et al.  Identification of Functional piRNAs Using a Convolutional Neural Network , 2020, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[15]  Ran Su,et al.  Exploring sequence‐based features for the improved prediction of DNA N4‐methylcytosine sites in multiple species , 2018, Bioinform..

[16]  Tyson A. Clark,et al.  Direct detection of DNA methylation during single-molecule, real-time sequencing , 2010, Nature Methods.

[17]  Michiaki Hamada,et al.  DeepM6ASeq: prediction and characterization of m6A-containing sequences using deep learning , 2018, BMC Bioinformatics.

[18]  Stephan Beck,et al.  3-methylcytosine in cancer: an underappreciated methyl lesion? , 2016, Epigenomics.

[19]  Sher Afzal Khan,et al.  iNuc-ext-PseTNC: an efficient ensemble model for identification of nucleosome positioning by extending the concept of Chou’s PseAAC to pseudo-tri-nucleotide composition , 2018, Molecular Genetics and Genomics.

[20]  M. Ehrlich,et al.  N4-methylcytosine as a minor base in bacterial DNA , 1987, Journal of bacteriology.

[21]  Zhongming Zhao,et al.  Deep4mC: systematic assessment and computational prediction for DNA N4-methylcytosine sites by deep learning , 2020, Briefings Bioinform..

[22]  Cangzhi Jia,et al.  4mCPred: machine learning methods for DNA N4‐methylcytosine sites prediction , 2018, Bioinform..

[23]  K. Chou,et al.  pLoc_Deep-mVirus: A CNN Model for Predicting Subcellular Localization of Virus Proteins by Deep Learning , 2020 .

[24]  Kil To Chong,et al.  kDeepBind: Prediction of RNA-Proteins binding sites using convolution neural network and k-gram features , 2021 .

[25]  Maqsood Hayat,et al.  iNuc-STNC: a sequence-based predictor for identification of nucleosome positioning in genomes by extending the concept of SAAC and Chou's PseAAC. , 2016, Molecular bioSystems.

[26]  Kil To Chong,et al.  iRNA-PseKNC(2methyl): Identify RNA 2'-O-methylation sites by convolution neural network and Chou's pseudo components. , 2019, Journal of theoretical biology.

[27]  Wei Chen,et al.  iDNA4mC: identifying DNA N4‐methylcytosine sites based on nucleotide chemical properties , 2017, Bioinform..

[28]  Z. Xuan,et al.  DNA sequences performs as natural language processing by exploiting deep learning algorithm for the identification of N4-methylcytosine , 2021, Scientific Reports.

[29]  Lan Yao,et al.  A Deep Neural Network for Identifying DNA N4-Methylcytosine Sites , 2020, Frontiers in Genetics.

[30]  R. Jiang,et al.  Prediction of enhancer-promoter interactions via natural language processing , 2018, BMC Genomics.

[31]  Lei Wang,et al.  A Convolutional Neural Network Using Dinucleotide One-hot Encoder for identifying DNA N6-Methyladenine Sites in the Rice Genome , 2021, Neurocomputing.

[32]  Xiaodong Cheng,et al.  DNA modification by methyltransferases. , 1995, Current opinion in structural biology.