A Bioinformatics Tool for the Prediction of DNA N6-Methyladenine Modifications Based on Feature Fusion and Optimization Protocol

DNA N6-methyladenine (6mA) is closely involved with various biological processes. Identifying the distributions of 6mA modifications in genome-scale is of great significance to in-depth understand the functions. In recent years, various experimental and computational methods have been proposed for this purpose. Unfortunately, existing methods cannot provide accurate and fast 6mA prediction. In this study, we present 6mAPred-FO, a bioinformatics tool that enables researchers to make predictions based on sequences only. To sufficiently capture the characteristics of 6mA sites, we integrate the sequence-order information with nucleotide positional specificity information for feature encoding, and further improve the feature representation capacity by analysis of variance-based feature optimization protocol. The experimental results show that using this feature protocol, we can significantly improve the predictive performance. Via further feature analysis, we found that the sequence-order information and positional specificity information are complementary to each other, contributing to the performance improvement. On the other hand, the improvement is also due to the use of the feature optimization protocol, which is capable of effectively capturing the most informative features from the original feature space. Moreover, benchmarking comparison results demonstrate that our 6mAPred-FO outperforms several existing predictors. Finally, we establish a web-server that implements the proposed method for convenience of researchers' use, which is currently available at http://server.malab.cn/6mAPred-FO.

[1]  Bin Liu,et al.  DeepSVM-fold: protein fold recognition by combining support vector machines and pairwise sequence similarity scores generated by deep learning networks , 2019, Briefings Bioinform..

[2]  Wen Zhang,et al.  Predicting human splicing branchpoints by combining sequence-derived features and multi-label learning methods , 2017, BMC Bioinformatics.

[3]  Liujuan Cao,et al.  A novel features ranking metric with application to scalable visual and bioinformatics data classification , 2016, Neurocomputing.

[4]  Jiangning Song,et al.  ACPred-FL: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides , 2018, Bioinform..

[5]  Gaotao Shi,et al.  CPPred-RF: A Sequence-based Predictor for Identifying Cell-Penetrating Peptides and Their Uptake Efficiency. , 2017, Journal of proteome research.

[6]  Ran Su,et al.  M6APred-EL: A Sequence-Based Predictor for Identifying N6-methyladenosine Sites Using Ensemble Learning , 2018, Molecular therapy. Nucleic acids.

[7]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[8]  Zhi Xie,et al.  MethSMRT: an integrative database for DNA N6-methyladenine and N4-methylcytosine generated by single-molecular real-time sequencing , 2016, Nucleic Acids Res..

[9]  Feng Liu,et al.  PredLnc-GFStack: A Global Sequence Feature Based on a Stacked Ensemble Learning Method for Predicting lncRNAs from Transcripts , 2019, Genes.

[10]  S. Linn,et al.  Host specificity of DNA produced by Escherichia coli. XI. In vitro modification of phage fd replicative form. , 1968, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Han Zhang,et al.  BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches , 2019, Nucleic acids research.

[12]  S. Linn,et al.  Host specificity of DNA produced by Escherichia coli, X. In vitro restriction of phage fd replicative form. , 1962, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Linqiang Pan,et al.  Features Identification for Phenotypic Classification Based on Genes and Gene Pairs , 2017, Current Bioinformatics.

[14]  E. Greer,et al.  N6-Methyladenine: A Conserved and Dynamic DNA Mark. , 2016, Advances in experimental medicine and biology.

[15]  Gwang Lee,et al.  PVP-SVM: Sequence-Based Prediction of Phage Virion Proteins Using a Support Vector Machine , 2018, Front. Microbiol..

[16]  Kai Li,et al.  iPromoter-2L2.0: Identifying Promoters and Their Types by Combining Smoothing Cutting Window Algorithm and Sequence-Based Features , 2019, Molecular therapy. Nucleic acids.

[17]  M. Meselson,et al.  DNA Restriction Enzyme from E. coli , 1968, Nature.

[18]  Q. Zou,et al.  Cancer Diagnosis Through IsomiR Expression with Machine Learning Method , 2016 .

[19]  W. Arber,et al.  Host specificity of DNA produced by Escherichia coli. II. Control over acceptance of DNA from infecting phage lambda. , 1962, Journal of molecular biology.

[20]  Weihua Dong,et al.  Real-Time Forecasting of Hand-Foot-and-Mouth Disease Outbreaks using the Integrating Compartment Model and Assimilation Filtering , 2019, Scientific Reports.

[21]  T. Katagiri,et al.  Cancer Diagnosis , 1992, Springer Berlin Heidelberg.

[22]  Tyson A. Clark,et al.  Direct detection of DNA methylation during single-molecule, real-time sequencing , 2010, Nature Methods.

[23]  Fariza Tahi,et al.  Bi-objective integer programming for RNA secondary structure prediction with pseudoknots , 2018, BMC Bioinformatics.

[24]  Yan Lin,et al.  iTerm-PseKNC: a sequence-based tool for predicting bacterial transcriptional terminators , 2018, Bioinform..

[25]  Kristina M Smith,et al.  Genome-wide high throughput analysis of DNA methylation in eukaryotes. , 2009, Methods.

[26]  Fei Guo,et al.  Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier , 2017, Artif. Intell. Medicine.

[27]  S. Luria,et al.  A NONHEREDITARY, HOST-INDUCED VARIATION OF BACTERIAL VIRUSES , 1952, Journal of bacteriology.

[28]  K. Chou,et al.  PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. , 2014, Analytical biochemistry.

[29]  A. Krais,et al.  Genomic N6‐methyladenine determination by MEKC with LIF , 2010, Electrophoresis.

[30]  Dong Wang,et al.  iLoc‐lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC , 2018, Bioinform..

[31]  Ping Wang,et al.  Discriminating ramos and jurkat cells with image textures from diffraction imaging flow cytometry based on a support vector machine , 2016 .

[32]  Yucong Duan,et al.  70ProPred: a predictor for discovering sigma70 promoters based on combining multiple features , 2018, BMC Syst. Biol..

[33]  Hui Ding,et al.  A Random Forest Sub-Golgi Protein Classifier Optimized via Dipeptide and Amino Acid Composition Features , 2019, Front. Bioeng. Biotechnol..

[34]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[35]  Bianbian Li,et al.  Epigenetic DNA Modification N6-Methyladenine Inhibits DNA Replication by DNA Polymerase of Pseudomonas aeruginosa Phage PaP1. , 2019, Chemical research in toxicology.

[36]  Yufeng Wu,et al.  Phytophthora methylomes are modulated by 6mA methyltransferases and associated with adaptive genome regions , 2018, Genome Biology.

[37]  Leyi Wei,et al.  mAHTPred: a sequence-based meta-predictor for improving the prediction of anti-hypertensive peptides using effective feature representation , 2018, Bioinform..

[38]  Quan Zou,et al.  Incorporating Distance-based Top-n-gram and Random Forest to Identify Electron Transport Proteins. , 2019, Journal of proteome research.

[39]  M. Meselson,et al.  Effects of high levels of DNA adenine methylation on methyl-directed mismatch repair in Escherichia coli. , 1983, Genetics.

[40]  Wen Zhang,et al.  Sequence-based bacterial small RNAs prediction using ensemble learning strategies , 2018, BMC Bioinformatics.

[41]  Zhirong Sun,et al.  AthMethPre: a web server for the prediction and query of mRNA m6A sites in Arabidopsis thaliana. , 2016, Molecular bioSystems.

[42]  Jijun Tang,et al.  PhosPred-RF: A Novel Sequence-Based Predictor for Phosphorylation Sites Using Sequential Information Only , 2017, IEEE Transactions on NanoBioscience.

[43]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[44]  Wei Chen,et al.  Classifying Included and Excluded Exons in Exon Skipping Event Using Histone Modifications , 2018, Front. Genet..

[45]  Yuchong Gong,et al.  A network embedding-based multiple information integration method for the MiRNA-disease association prediction , 2019, BMC Bioinformatics.

[46]  Feng Huang,et al.  SFPEL-LPI: Sequence-based feature projection ensemble learning for predicting LncRNA-protein interactions , 2018, PLoS Comput. Biol..

[47]  Yanlin Chen,et al.  SFLLN: A sparse feature learning ensemble method with linear neighborhood regularization for predicting drug-drug interactions , 2019, Inf. Sci..

[48]  Hong Wang,et al.  The decreased N6-methyladenine DNA modification in cancer cells. , 2016, Biochemical and biophysical research communications.

[49]  L. Aravind,et al.  DNA Methylation on N6-Adenine in C. elegans , 2015, Cell.

[50]  M. Marinus,et al.  Analysis of Global Gene Expression and Double-Strand-Break Formation in DNA Adenine Methyltransferase- and Mismatch Repair-Deficient Escherichia coli , 2005, Journal of bacteriology.

[51]  Bin Liu,et al.  ProtDec-LTR3.0: Protein Remote Homology Detection by Incorporating Profile-Based Features Into Learning to Rank , 2019, IEEE Access.

[52]  N. Kleckner,et al.  E. coli oriC and the dnaA gene promoter are sequestered from dam methyltransferase following the passage of the chromosomal replication fork , 1990, Cell.

[53]  Q. Zou,et al.  Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA , 2018, RNA.

[54]  Jooyoung Lee,et al.  SVMQA: support‐vector‐machine‐based protein single‐model quality assessment , 2017, Bioinform..

[55]  Shuigeng Zhou,et al.  Predicting Enhancers from Multiple Cell Lines and Tissues across Different Developmental Stages Based On SVM Method , 2018, Current Bioinformatics.

[56]  Wei Chen,et al.  i6mA-Pred: identifying DNA N6-methyladenine sites in the rice genome , 2019, Bioinform..

[57]  Wei Chen,et al.  iDNA4mC: identifying DNA N4‐methylcytosine sites based on nucleotide chemical properties , 2017, Bioinform..

[58]  Yu Zhao,et al.  Identification and analysis of adenine N6-methylation sites in the rice genome , 2018, Nature Plants.

[59]  Bin Liu,et al.  BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches , 2019, Briefings Bioinform..

[60]  Wei Chen,et al.  PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions , 2015, Bioinform..

[61]  B. Liu,et al.  iRO-PsekGCC: Identify DNA Replication Origins Based on Pseudo k-Tuple GC Composition , 2019, Front. Genet..

[62]  Yan He,et al.  Classification of Small GTPases with Hybrid Protein Features and Advanced Machine Learning Techniques , 2017, Current Bioinformatics.

[63]  Leyi Wei,et al.  Meta-4mCpred: A Sequence-Based Meta-Predictor for Accurate DNA 4mC Site Prediction Using Effective Feature Representation , 2019, Molecular therapy. Nucleic acids.

[64]  K. Chou,et al.  iRNA-Methyl: Identifying N(6)-methyladenosine sites using pseudo nucleotide composition. , 2015, Analytical biochemistry.

[65]  Feng Huang,et al.  A Fast Linear Neighborhood Similarity-Based Network Link Inference Method to Predict MicroRNA-Disease Associations , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[66]  Leyi Wei,et al.  A novel hierarchical selective ensemble classifier with bioinformatics application , 2017, Artif. Intell. Medicine.

[67]  Kil To Chong,et al.  iDNA6mA (5-step rule): Identification of DNA N6-methyladenine sites in the rice genome by intelligent computational model via Chou's 5-step rule , 2019, Chemometrics and Intelligent Laboratory Systems.

[68]  Jing Zhao,et al.  Using Machine Learning to Measure Relatedness Between Genes: A Multi-Features Model , 2019, Scientific Reports.

[69]  Sen Liang,et al.  A Review of Matched-pairs Feature Selection Methods for Gene Expression Data Analysis , 2018, Computational and structural biotechnology journal.

[70]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.