i6mA-CNN: a convolution based computational approach towards identification of DNA N6-methyladenine sites in rice genome

DNA N6-methylation (6mA) in Adenine nucleotide is a post replication modification and is responsible for many biological functions. Experimental methods for genome wide 6mA site detection is an expensive and manual labour intensive process. Automated and accurate computational methods can help to identify 6mA sites in long genomes saving significant time and money. Our study develops a convolutional neural network based tool i6mA-CNN capable of identifying 6mA sites in the rice genome. Our model coordinates among multiple types of features such as PseAAC inspired customized feature vector, multiple one hot representations and dinucleotide physicochemical properties. It achieves area under the receiver operating characteristic curve of 0.98 with an overall accuracy of 0.94 using 5 fold cross validation on benchmark dataset. Finally, we evaluate our model on two other plant genome 6mA site identification datasets besides rice. Results suggest that our proposed tool is able to generalize its ability of 6mA site identification on plant genomes irrespective of plant species. Web tool for this research can be found at: https://cutt.ly/Co6KuWG. Supplementary data (benchmark dataset, independent test dataset, comparison purpose dataset, trained model, physicochemical property values, attention mechanism details for motif finding) are available at https://cutt.ly/PpDdeDH.

[1]  Liang Kong,et al.  i6mA-DNCP: Computational Identification of DNA N6-Methyladenine Sites in the Rice Genome Using Optimized Dinucleotide-Based Features , 2019, Genes.

[2]  K. Chou,et al.  Low-frequency collective motion in biomacromolecules and its biological functions. , 1988, Biophysical chemistry.

[3]  Sajid Ahmed,et al.  iPromoter-BnCNN: a Novel Branched CNN Based Predictor for Identifying and Classifying Sigma Promoters. , 2020, Bioinformatics.

[4]  Chuan-Le Xiao,et al.  MDR: an integrative DNA N6-methyladenine and N4-methylcytosine modification database for Rosaceae , 2019, Horticulture Research.

[5]  Zhongming Zhao,et al.  6mA-Finder: a novel online tool for predicting DNA N6-methyladenine sites in genomes , 2020, Bioinform..

[6]  Zhi Xie,et al.  MethSMRT: an integrative database for DNA N6-methyladenine and N4-methylcytosine generated by single-molecular real-time sequencing , 2016, Nucleic Acids Res..

[7]  Douglas M. Hawkins,et al.  The Problem of Overfitting , 2004, J. Chem. Inf. Model..

[8]  Abien Fred Agarap Deep Learning using Rectified Linear Units (ReLU) , 2018, ArXiv.

[9]  Fei Li,et al.  MM-6mAPred: identifying DNA N6-methyladenine sites based on Markov model , 2019, Bioinform..

[10]  Yu Zhao,et al.  Identification and analysis of adenine N6-methylation sites in the rice genome , 2018, Nature Plants.

[11]  Adam Godzik,et al.  Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences , 2006, Bioinform..

[12]  D. Wion,et al.  N6-methyl-adenine: an epigenetic signal for DNA–protein interactions , 2006, Nature Reviews Microbiology.

[13]  W. Zhong,et al.  Molecular Science for Drug Development and Biomedicine , 2014, International journal of molecular sciences.

[14]  Fan Liang,et al.  DNA N6-adenine methylation in Arabidopsis thaliana , 2017, Mechanisms of Development.

[16]  A. Krais,et al.  Genomic N6‐methyladenine determination by MEKC with LIF , 2010, Electrophoresis.

[17]  Hao Lin,et al.  iDNA6mA-Rice: A Computational Tool for Detecting N6-Methyladenine Sites in Rice , 2019, Front. Genet..

[18]  Kristina M Smith,et al.  Genome-wide high throughput analysis of DNA methylation in eukaryotes. , 2009, Methods.

[19]  K. Chou,et al.  PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. , 2014, Analytical biochemistry.

[20]  P. Modrich,et al.  Extent of equilibrium perturbation of the DNA helix upon enzymatic methylation of adenine residues. , 1985, The Journal of biological chemistry.

[21]  Wei Chen,et al.  i6mA-Pred: identifying DNA N6-methyladenine sites in the rice genome , 2019, Bioinform..

[22]  Shuai Liu,et al.  Transcriptome Comparisons of Multi-Species Identify Differential Genome Activation of Mammals Embryogenesis , 2019, IEEE Access.

[23]  D. Chicco,et al.  The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation , 2020, BMC Genomics.

[24]  Wenpeng Yin,et al.  Comparative Study of CNN and RNN for Natural Language Processing , 2017, ArXiv.

[25]  Zhiming Dai,et al.  SNNRice6mA: A Deep Learning Method for Predicting DNA N6-Methyladenine Sites in Rice Genome , 2019, Front. Genet..

[26]  Kil To Chong,et al.  iDNA6mA (5-step rule): Identification of DNA N6-methyladenine sites in the rice genome by intelligent computational model via Chou's 5-step rule , 2019, Chemometrics and Intelligent Laboratory Systems.

[27]  K. Chou,et al.  Collective motion in DNA and its role in drug intercalation , 1988, Biopolymers.

[28]  Lei Wang,et al.  A Convolutional Neural Network Using Dinucleotide One-hot Encoder for identifying DNA N6-Methyladenine Sites in the Rice Genome , 2021, Neurocomputing.

[29]  V. Solovyev,et al.  Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks , 2016, PloS one.

[30]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[31]  Wei Chen,et al.  Classifying Included and Excluded Exons in Exon Skipping Event Using Histone Modifications , 2018, Front. Genet..

[32]  K. Chou,et al.  Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. , 2015, Molecular bioSystems.

[33]  K. Chou,et al.  iDNA6mA-PseKNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. , 2018, Genomics.

[34]  Jonas Korlach,et al.  Abstract 1154: Direct detection of DNA methylation and mutagenic damage through single-molecule, real-time (SMRTTM) DNA sequencing , 2010 .

[35]  Tyson A. Clark,et al.  Direct detection of DNA methylation during single-molecule, real-time sequencing , 2010, Nature Methods.

[36]  George M Church,et al.  pLogo: a probabilistic approach to visualizing sequence motifs , 2013, Nature Methods.

[37]  Tao Chen,et al.  Expert Systems With Applications , 2022 .

[38]  Shanxin Zhang,et al.  pDHS-DSET: Prediction of DNase I hypersensitive sites in plant genome using DS evidence theory. , 2019, Analytical biochemistry.

[39]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[40]  Sajid Ahmed,et al.  iPromoter-BnCNN: a novel branched CNN based predictor for identifying and classifying sigma promoters , 2020, Bioinform..

[41]  H. Stunnenberg,et al.  Impairment of DNA Methylation Maintenance Is the Main Cause of Global Demethylation in Naive Embryonic Stem Cells , 2016, Molecular cell.

[42]  K. Chou,et al.  PseAAC: a flexible web server for generating various kinds of protein pseudo amino acid composition. , 2008, Analytical biochemistry.

[43]  E. Greer,et al.  N6-Methyladenine: A Conserved and Dynamic DNA Mark. , 2016, Advances in experimental medicine and biology.

[44]  Liang Kong,et al.  iRSpot-PDI: Identification of recombination spots by incorporating dinucleotide property diversity information into Chou's pseudo components. , 2019, Genomics.

[45]  Guo-Ping Zhou,et al.  Perspectives in Medicinal Chemistry. , 2015, Current topics in medicinal chemistry.

[46]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.