A Convolutional Neural Network Using Dinucleotide One-hot Encoder for identifying DNA N6-Methyladenine Sites in the Rice Genome

Abstract N6-methyladenine (m6A) is one of the crucial epigenetic modifications and is related to the control of various DNA processes. Carrying out a genome-wide m6A analysis via wet experiments is fundamental but takes a long time. As complementary methods, computing tools, especially those based on machine learning, are urgently needed. A new protocol, iRicem6A-CNN, for identifying m6A sites in the rice genome was developed. This protocol was designed to use dinucleotide one-hot encoding to generate input tensors for predictions by convolutional neutral networks, and achieved five-fold cross-validation and independent testing accuracy values of 93.82% and 96.19%, respectively, performing better than those of other available predictors. The experiment results demonstrates that only the ability of iRicem6A-CNN based on 2-mer one-hot encoding is to display high performance but also to perform more stably and robustly than models using 1-mer one-hot encoding. A webserver is accessible at http://iRicem6A-CNN.aibiochem.net

[1]  Jiu-Xin Tan,et al.  Evaluation of different computational methods on 5-methylcytosine sites identification , 2020, Briefings Bioinform..

[2]  Feng Zhu,et al.  VARIDT 1.0: variability of drug transporter database , 2019, Nucleic Acids Res..

[3]  Miriam A. M. Capretz,et al.  Machine Learning With Big Data: Challenges and Approaches , 2017, IEEE Access.

[4]  Bo Li,et al.  NOREVA: normalization and evaluation of MS-based metabolomics data , 2017, Nucleic Acids Res..

[5]  Xia Sun,et al.  Drug and Nondrug Classification Based on Deep Learning with Various Feature Selection Strategies , 2018 .

[6]  Jijun Tang,et al.  Prediction of human protein subcellular localization using deep learning , 2017, J. Parallel Distributed Comput..

[7]  Xiangxiang Zeng,et al.  Prediction of potential disease-associated microRNAs using structural perturbation method , 2017, bioRxiv.

[8]  T. P. Centeno,et al.  DNA methylation changes in plasticity genes accompany the formation and maintenance of memory , 2015, Nature Neuroscience.

[9]  Leyi Wei,et al.  A novel hierarchical selective ensemble classifier with bioinformatics application , 2017, Artif. Intell. Medicine.

[10]  Wanying Xu,et al.  OAHG: an integrated resource for annotating human genes with multi-level ontologies , 2016, Scientific Reports.

[11]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[12]  Bifeng Yuan,et al.  N 6-Hydroxymethyladenine: a hydroxylation derivative of N6-methyladenine in genomic DNA of mammals , 2018, Nucleic acids research.

[13]  Wei Tao,et al.  A comprehensive comparison and analysis of computational predictors for RNA N6-methyladenosine sites of Saccharomyces cerevisiae. , 2019, Briefings in functional genomics.

[14]  Fei Guo,et al.  Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier , 2017, Artif. Intell. Medicine.

[15]  Adam Ameur,et al.  Single-Molecule Sequencing: Towards Clinical Applications. , 2019, Trends in biotechnology.

[16]  Feng Zhu,et al.  Therapeutic target database 2020: enriched resource for facilitating research and early development of targeted therapeutics , 2019, Nucleic Acids Res..

[17]  Jijun Tang,et al.  Predicting protein-protein interactions via multivariate mutual information of protein sequences , 2016, BMC Bioinformatics.

[18]  Yuzong Chen,et al.  What Contributes to Serotonin-Norepinephrine Reuptake Inhibitors' Dual-Targeting Mechanism? The Key Role of Transmembrane Domain 6 in Human Serotonin and Norepinephrine Transporters Revealed by Molecular Dynamics Simulation. , 2018, ACS chemical neuroscience.

[19]  Jin Zhao,et al.  Drug repositioning based on triangularly balanced structure for tissue-specific diseases in incomplete interactome , 2017, Artif. Intell. Medicine.

[20]  N. Le iN6-methylat (5-step): identifying DNA N6-methyladenine sites in rice genome using continuous bag of nucleobases via Chou’s 5-step rule , 2019, Molecular Genetics and Genomics.

[21]  Binhua Tang,et al.  Recent Advances of Deep Learning in Bioinformatics and Computational Biology , 2019, Front. Genet..

[22]  Fei Li,et al.  MM-6mAPred: identifying DNA N6-methyladenine sites based on Markov model , 2019, Bioinform..

[23]  Xiangxiang Zeng,et al.  MOEA/HD: A Multiobjective Evolutionary Algorithm Based on Hierarchical Decomposition , 2019, IEEE Transactions on Cybernetics.

[24]  Tyson A. Clark,et al.  Direct detection of DNA methylation during single-molecule, real-time sequencing , 2010, Nature Methods.

[25]  Xiangrong Liu,et al.  On solutions and representations of spiking neural P systems with rules on synapses , 2019, Inf. Sci..

[26]  Jijun Tang,et al.  Identification of Protein–Protein Interactions via a Novel Matrix-Based Sequence Representation Model with Amino Acid Contact Information , 2016, International journal of molecular sciences.

[27]  Lei Deng,et al.  Prediction of Protein S-Sulfenylation Sites Using a Deep Belief Network , 2018, Current Bioinformatics.

[28]  B. Liu,et al.  An Approach for Identifying Cytokines Based on a Novel Ensemble Classifier , 2013, BioMed research international.

[29]  Shunmin He,et al.  N6-Methyladenine DNA Modification in Drosophila , 2015, Cell.

[30]  Jiu-Xin Tan,et al.  Identification of hormone binding proteins based on machine learning methods. , 2019, Mathematical biosciences and engineering : MBE.

[31]  Jionglong Su,et al.  WHISTLE: a high-accuracy map of the human N6-methyladenosine (m6A) epitranscriptome predicted using a machine learning approach , 2019, Nucleic acids research.

[32]  Gaotao Shi,et al.  CPPred-RF: A Sequence-based Predictor for Identifying Cell-Penetrating Peptides and Their Uptake Efficiency. , 2017, Journal of proteome research.

[33]  Jijun Tang,et al.  Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou's general PseAAC. , 2019, Journal of theoretical biology.

[34]  S Joshua Swamidass,et al.  A deep learning approach to automate refinement of somatic variant calling from cancer sequencing data , 2018, Nature Genetics.

[35]  Liang Kong,et al.  i6mA-DNCP: Computational Identification of DNA N6-Methyladenine Sites in the Rice Genome Using Optimized Dinucleotide-Based Features , 2019, Genes.

[36]  Feng Zhu,et al.  Differentiating Physicochemical Properties between Addictive and Nonaddictive ADHD Drugs Revealed by Molecular Dynamics Simulation Studies. , 2017, ACS chemical neuroscience.

[37]  Xiaofeng Li,et al.  ANPELA: analysis and performance assessment of the label-free quantification workflow for metaproteomic studies , 2019, Briefings Bioinform..

[38]  Guangmin Liang,et al.  An Efficient Classifier for Alzheimer’s Disease Genes Identification , 2018, Molecules.

[39]  Lixia Yao,et al.  Simultaneous Improvement in the Precision, Accuracy, and Robustness of Label-free Proteome Quantification by Optimizing Data Manipulation Chains. , 2019, Molecular & cellular proteomics : MCP.

[40]  Liang Cheng,et al.  gutMDisorder: a comprehensive database for dysbiosis of the gut microbiota in disorders and interventions , 2019, Nucleic acids research.

[41]  Zhibin Lv,et al.  Protein Function Prediction: From Traditional Classifier to Deep Learning , 2019, Proteomics.

[42]  Lin Gao,et al.  Predicting Potential Drugs for Breast Cancer based on miRNA and Tissue Specificity , 2018, International journal of biological sciences.

[43]  Jing Zhang,et al.  Prediction of Novel Drugs for Hepatocellular Carcinoma Based on Multi-Source Random Walk , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[44]  Hao Lin,et al.  iDNA6mA-Rice: A Computational Tool for Detecting N6-Methyladenine Sites in Rice , 2019, Front. Genet..

[45]  P. Modrich,et al.  Initiation of methyl-directed mismatch repair. , 1992, The Journal of biological chemistry.

[46]  Xiangrong Liu,et al.  deepDR: a network-based deep learning approach to in silico drug repositioning , 2019, Bioinform..

[47]  Dong-Qing Wei,et al.  PredT4SE-Stack: Prediction of Bacterial Type IV Secreted Effectors From Protein Sequences Using a Stacked Ensemble Method , 2018, Front. Microbiol..

[48]  Liang Yu,et al.  Conserved Disease Modules Extracted From Multilayer Heterogeneous Disease and Gene Networks for Understanding Disease Mechanisms and Predicting Disease Treatments , 2019, Front. Genet..

[49]  M. Huss,et al.  A primer on deep learning in genomics , 2018, Nature Genetics.

[50]  Ming Zhang,et al.  Accurate RNA 5-methylcytosine site prediction based on heuristic physical-chemical properties reduction and classifier ensemble. , 2018, Analytical biochemistry.

[51]  Fei Guo,et al.  MDA-SKF: Similarity Kernel Fusion for Accurately Discovering miRNA-Disease Association , 2018, Front. Genet..

[52]  Guohua Huang,et al.  The Advances and Challenges of Deep Learning Application in Biological Big Data Processing , 2017, Current Bioinformatics.

[53]  Jie Sun,et al.  DincRNA: a comprehensive web-based bioinformatics toolkit for exploring disease associations and ncRNA function , 2018, Bioinform..

[54]  Xiangxiang Zeng,et al.  An Evolutionary Algorithm Based on Minkowski Distance for Many-Objective Optimization , 2019, IEEE Transactions on Cybernetics.

[55]  Yasen Jiao,et al.  Performance measures in evaluating machine learning based bioinformatics predictors for classifications , 2016, Quantitative Biology.

[56]  Jijun Tang,et al.  Identification of drug-target interactions via multiple information integration , 2017, Inf. Sci..

[57]  A. Giangrande,et al.  Drosophila melanogaster as a Model to Study the Multiple Phenotypes, Related to Genome Stability of the Fragile-X Syndrome , 2019, Front. Genet..

[58]  Kil To Chong,et al.  iDNA6mA (5-step rule): Identification of DNA N6-methyladenine sites in the rice genome by intelligent computational model via Chou's 5-step rule , 2019, Chemometrics and Intelligent Laboratory Systems.

[59]  Yurong Liu,et al.  A survey of deep neural network architectures and their applications , 2017, Neurocomputing.

[60]  Han Zhang,et al.  BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches , 2019, Nucleic acids research.

[61]  Jijun Tang,et al.  FKL-Spa-LapRLS: an accurate method for identifying human microRNA-disease association , 2018, BMC Genomics.

[62]  Byunghan Lee,et al.  Deep learning in bioinformatics , 2016, Briefings Bioinform..

[63]  Minghui He,et al.  N6-Methyladenine DNA Modification in the Human Genome. , 2018, Molecular cell.

[64]  Pengfei Wang,et al.  Robust feature learning for online discriminative tracking without large-scale pre-training , 2017, Frontiers of Computer Science.

[65]  Xiangxiang Zeng,et al.  Prediction and Validation of Disease Genes Using HeteSim Scores , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[66]  B. Henrissat,et al.  Multi-omic analyses of exogenous nutrient bag decomposition by the black morel Morchella importuna reveal sustained carbon acquisition and transferring. , 2019, Environmental microbiology.

[67]  Guangmin Liang,et al.  SeqSVM: A Sequence-Based Support Vector Machine Method for Identifying Antioxidant Proteins , 2018, International journal of molecular sciences.

[68]  Jiajie Peng,et al.  InfAcrOnt: calculating cross-ontology term similarities using information flow by a random walk , 2018, BMC Genomics.

[69]  Xiangxiang Zeng,et al.  Inferring MicroRNA-Disease Associations by Random Walk on a Heterogeneous Network with Multiple Data Sources , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[70]  Fabian J Theis,et al.  Deep learning: new computational modelling techniques for genomics , 2019, Nature Reviews Genetics.

[71]  Guangmin Liang,et al.  A Novel Hybrid Sequence-Based Model for Identifying Anticancer Peptides , 2018, Genes.

[72]  Q. Zou,et al.  Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA , 2018, RNA.

[73]  Ruimao Zhang,et al.  Learning deep representations for semantic image parsing: a comprehensive overview , 2018, Frontiers of Computer Science.

[74]  Q. Zou,et al.  Deep learning in omics: a survey and guideline , 2018, Briefings in functional genomics.

[75]  Dong-Qing Wei,et al.  Prediction of CYP450 Enzyme-Substrate Selectivity Based on the Network-Based Label Space Division Method , 2019, J. Chem. Inf. Model..

[76]  Michael C. Schatz,et al.  Accurate detection of complex structural variations using single molecule sequencing , 2017, Nature Methods.

[77]  Alfonso Rodríguez-Patón,et al.  Meta-Path Methods for Prioritizing Candidate Disease miRNAs , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[78]  Yan Lin,et al.  iTerm-PseKNC: a sequence-based tool for predicting bacterial transcriptional terminators , 2018, Bioinform..

[79]  R. Morgan,et al.  N4-cytosine DNA methylation regulates transcription and pathogenesis in Helicobacter pylori , 2018, Nucleic acids research.

[80]  Jijun Tang,et al.  Identification of drug-side effect association via multiple information integration with centered kernel alignment , 2019, Neurocomputing.

[81]  Guangmin Liang,et al.  k-Skip-n-Gram-RF: A Random Forest Based Method for Alzheimer's Disease Protein Identification , 2019, Front. Genet..

[82]  Gwang Lee,et al.  SDM6A: A Web-Based Integrative Machine-Learning Framework for Predicting 6mA Sites in the Rice Genome , 2019, Molecular therapy. Nucleic acids.