A deep learning-based computational approach for discrimination of DNA N6-methyladenosine sites by fusing heterogeneous features

Abstract N6-methyladenine is post-replication modifications, which take place in the extensive range of DNA sequences and involved with a large number of different bioprocesses such as DNA repair, replication, cellular defense, and transcription in prokaryotes. Recently, various computational models were established to predict N6-methyladenine sites within DNAs. However, one of the main issues in the precise prediction of N6-methyladenine is the extraction of those features, which clearly define the characteristics of N6-methyladenine sites. In this method, input sequences of DNA are expressed by one-hot representation in order to allow progressive convolution layers. To exhibit the hidden information from the recognized sequences, the convolution neural network (CNN) model is applied to automatically learn the abstract features. Then, we apply the tri-nucleotide Composition (TNC) feature extraction technique and concatenate with CNN features. Our proposed model achieved 98.05% accuracy for the S1 benchmark dataset and 89.22% accuracy for the S2 benchmark dataset. The classification rates demonstrated that the developed approach performed better compared to existing approaches in terms of all the evaluation measures. It is expected that the developed intelligent approach might be played a leading and progressive role for academia as well as industrial research in the area of genomics prediction. The code cv is attached here.

[1]  Jiangning Song,et al.  Quokka: a comprehensive tool for rapid and accurate prediction of kinase family‐specific phosphorylation sites in the human proteome , 2018, Bioinform..

[2]  Gholamreza Haffari,et al.  PROSPERous: high-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy , 2018, Bioinform..

[3]  S. Luria,et al.  A NONHEREDITARY, HOST-INDUCED VARIATION OF BACTERIAL VIRUSES , 1952, Journal of bacteriology.

[4]  A. Krais,et al.  Genomic N6‐methyladenine determination by MEKC with LIF , 2010, Electrophoresis.

[5]  Gholamreza Haffari,et al.  Twenty years of bioinformatics research for protease-specific substrate and cleavage site prediction: a comprehensive revisit and benchmarking of existing methods , 2018, Briefings Bioinform..

[6]  M. Meselson,et al.  DNA Restriction Enzyme from E. coli , 1968, Nature.

[7]  Peng Wang,et al.  iSS-PC: Identifying Splicing Sites via Physical-Chemical Properties Using Deep Sparse Auto-Encoder , 2017, Scientific Reports.

[8]  K. Chou,et al.  iRNA-PseColl: Identifying the Occurrence Sites of Different RNA Modifications by Incorporating Collective Effects of Nucleotides into PseKNC , 2017, Molecular therapy. Nucleic acids.

[9]  Hao Lv,et al.  Identify origin of replication in Saccharomyces cerevisiae using two-step feature selection technique , 2018, Bioinform..

[10]  Kil To Chong,et al.  Prediction of N6-methyladenosine sites using convolution neural network model based on distributed feature representations , 2020, Neural Networks.

[11]  Wei Chen,et al.  iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition , 2013, Nucleic acids research.

[12]  Kil To Chong,et al.  iDNA6mA (5-step rule): Identification of DNA N6-methyladenine sites in the rice genome by intelligent computational model via Chou's 5-step rule , 2019, Chemometrics and Intelligent Laboratory Systems.

[13]  Hui Ding,et al.  iRNA(m6A)-PseDNC: Identifying N6-methyladenosine sites using pseudo dinucleotide composition. , 2018, Analytical biochemistry.

[14]  Tyson A. Clark,et al.  Direct detection of DNA methylation during single-molecule, real-time sequencing , 2010, Nature Methods.

[15]  K. Chou,et al.  Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. , 2015, Molecular bioSystems.

[16]  Wei Chen,et al.  iRNA-AI: identifying the adenosine to inosine editing sites in RNA sequences , 2016, Oncotarget.

[17]  Kristina M Smith,et al.  Genome-wide high throughput analysis of DNA methylation in eukaryotes. , 2009, Methods.

[18]  Wei Chen,et al.  iRNA-2OM: A Sequence-Based Predictor for Identifying 2′-O-Methylation Sites in Homo sapiens , 2018, J. Comput. Biol..

[19]  K. Chou,et al.  PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. , 2014, Analytical biochemistry.

[20]  James A. Swenberg,et al.  DNA methylation on N6-adenine in mammalian embryonic stem cells , 2016, Nature.

[21]  S. Linn,et al.  Host specificity of DNA produced by Escherichia coli. XI. In vitro modification of phage fd replicative form. , 1968, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Kil To Chong,et al.  iSS-CNN: Identifying splicing sites using convolution neural network , 2019, Chemometrics and Intelligent Laboratory Systems.

[23]  Sher Afzal Khan,et al.  A Two-Layer Computational Model for Discrimination of Enhancer and Their Types Using Hybrid Features Pace of Pseudo K-Tuple Nucleotide Composition , 2017, Arabian Journal for Science and Engineering.

[24]  Zhirong Sun,et al.  AthMethPre: a web server for the prediction and query of mRNA m6A sites in Arabidopsis thaliana. , 2016, Molecular bioSystems.

[25]  Geoffrey I. Webb,et al.  iProt-Sub: a comprehensive package for accurately mapping and predicting protease-specific substrates and cleavage sites , 2018, Briefings Bioinform..

[26]  Wei Chen,et al.  iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition , 2014, Nucleic acids research.

[27]  Q. Cui,et al.  SRAMP: prediction of mammalian N6-methyladenosine (m6A) sites based on sequence-derived features , 2016, Nucleic acids research.

[28]  Hui Yang,et al.  iDNA-MS: An Integrated Computational Tool for Detecting DNA Modification Sites in Multiple Genomes , 2020, iScience.

[29]  Hao Lin,et al.  Eukaryotic and prokaryotic promoter prediction using hybrid approach , 2011, Theory in Biosciences.

[30]  Akinori Awazu,et al.  Prediction of nucleosome positioning by the incorporation of frequencies and distributions of three different nucleotide segment lengths into a general pseudo k-tuple nucleotide composition , 2016, Bioinform..

[31]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[32]  L. Aravind,et al.  DNA Methylation on N6-Adenine in C. elegans , 2015, Cell.

[33]  Wei Chen,et al.  Identification and analysis of the N6-methyladenosine in the Saccharomyces cerevisiae transcriptome , 2015, Scientific Reports.

[34]  N. Kleckner,et al.  E. coli oriC and the dnaA gene promoter are sequestered from dam methyltransferase following the passage of the chromosomal replication fork , 1990, Cell.

[35]  M. Marinus,et al.  Analysis of Global Gene Expression and Double-Strand-Break Formation in DNA Adenine Methyltransferase- and Mismatch Repair-Deficient Escherichia coli , 2005, Journal of bacteriology.

[36]  Kuo-Chen Chou,et al.  iRSpot-Pse6NC: Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC , 2018, International journal of biological sciences.

[37]  Wei Chen,et al.  iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition , 2014, Bioinform..

[38]  Dong Wang,et al.  iLoc‐lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC , 2018, Bioinform..

[39]  Maqsood Hayat,et al.  Machine learning based identification of protein-protein interactions using derived features of physiochemical properties and evolutionary profiles , 2017, Artif. Intell. Medicine.

[40]  M. Meselson,et al.  Effects of high levels of DNA adenine methylation on methyl-directed mismatch repair in Escherichia coli. , 1983, Genetics.

[41]  K. Chou,et al.  iRNA-Methyl: Identifying N(6)-methyladenosine sites using pseudo nucleotide composition. , 2015, Analytical biochemistry.

[42]  B. Liu,et al.  Pse-Analysis: a python package for DNA/RNA and protein/peptide sequence analysis based on pseudo components and kernel methods , 2017, Oncotarget.

[43]  K. Chou,et al.  iDNA6mA-PseKNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. , 2018, Genomics.

[44]  Muhammad Tahir,et al.  Sequence based predictor for discrimination of enhancer and their types by applying general form of Chou's trinucleotide composition , 2017, Comput. Methods Programs Biomed..

[45]  Wei Chen,et al.  DNA4mC-LIP: a linear integration method to identify N4-methylcytosine site in multiple species , 2020, Bioinform..

[46]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[47]  Saeed Ahmad,et al.  iTIS-PseKNC: Identification of Translation Initiation Site in human genes using pseudo k-tuple nucleotides composition , 2015, Comput. Biol. Medicine.

[48]  Chuan He,et al.  DNA N6-methyladenine in metazoans: functional epigenetic mark or bystander? , 2017, Nature Structural &Molecular Biology.

[49]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups , 2012, IEEE Signal Processing Magazine.

[50]  K. Chou,et al.  REVIEW : Recent advances in developing web-servers for predicting protein attributes , 2009 .

[51]  Kil To Chong,et al.  An intelligent computational model for prediction of promoters and their strength via natural language processing , 2020 .

[52]  Wei Chen,et al.  i6mA-Pred: identifying DNA N6-methyladenine sites in the rice genome , 2019, Bioinform..

[53]  Wei Chen,et al.  iDNA4mC: identifying DNA N4‐methylcytosine sites based on nucleotide chemical properties , 2017, Bioinform..

[54]  Ge Yu,et al.  A novel cross-modal hashing algorithm based on multimodal deep learning , 2015, Science China Information Sciences.

[55]  Maqsood Hayat,et al.  iNuc-STNC: a sequence-based predictor for identification of nucleosome positioning in genomes by extending the concept of SAAC and Chou's PseAAC. , 2016, Molecular bioSystems.

[56]  Hao Lin,et al.  Identifying Sigma70 Promoters with Novel Pseudo Nucleotide Composition , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[57]  Christian Frezza,et al.  Identification of Methylated Deoxyadenosines in Genomic DNA by dA6m DNA Immunoprecipitation. , 2016, Bio-protocol.