M6AMRFS: Robust Prediction of N6-Methyladenosine Sites With Sequence-Based Features in Multiple Species

As one of the well-studied RNA methylation modifications, N6-methyladenosine (m6A) plays important roles in various biological progresses, such as RNA splicing and degradation, etc. Identification of m6A sites is fundamentally important for better understanding of their functional mechanisms. Recently, machine learning based prediction methods have emerged as an effective approach for fast and accurate identification of m6A sites. In this paper, we proposed “M6AMRFS”, a new machine learning based predictor for the identification of m6A sites. In this predictor, we exploited a new feature representation algorithm to encode RNA sequences with two feature descriptors (dinucleotide binary encoding and Local position-specific dinucleotide frequency), and used the F-score algorithm combined with SFS (Sequential Forward Search) to enhance the feature representation ability. To predict m6A sites, we employed the eXtreme Gradient Boosting (XGBoost) algorithm to build a predictive model. Benchmarking results showed that the proposed predictor is competitive with the state-of-the art predictors. Importantly, robust predictions for multiple species by our predictor demonstrate that our predictive models have strong generalization ability. To the best of our knowledge, M6AMRFS is the first tool that can be used for the identification of m6A sites in multiple species. To facilitate the use of our predictor, we have established a user-friendly webserver with the implementation of M6AMRFS, which is currently available in http://server.malab.cn/M6AMRFS/. We anticipate that it will be a useful tool for the relevant research of m6A sites.

[1]  Wei Chen,et al.  MethyRNA: a web server for identification of N6-methyladenosine sites , 2017, Journal of biomolecular structure & dynamics.

[2]  Yan He,et al.  Classification of Small GTPases with Hybrid Protein Features and Advanced Machine Learning Techniques , 2017, Current Bioinformatics.

[3]  Ren Long,et al.  iRSpot-EL: identify recombination spots with an ensemble learning approach , 2017, Bioinform..

[4]  Wei Chen,et al.  Identifying N6-methyladenosine sites in the Arabidopsis thaliana transcriptome , 2016, Molecular Genetics and Genomics.

[5]  Renxiang Yan,et al.  RFAthM6A: a new tool for predicting m6A sites in Arabidopsis thaliana , 2018, Plant Molecular Biology.

[6]  K. Chou,et al.  iRNA-Methyl: Identifying N(6)-methyladenosine sites using pseudo nucleotide composition. , 2015, Analytical biochemistry.

[7]  Dong Wang,et al.  iLoc‐lncRNA: predict the subcellular location of lncRNAs by incorporating octamer composition into general PseKNC , 2018, Bioinform..

[8]  Geoffrey I. Webb,et al.  GlycoMinestruct: a new bioinformatics tool for highly accurate mapping of the human N-linked and O-linked glycoproteomes by incorporating structural features , 2016, Scientific Reports.

[9]  Qi Zhou,et al.  m(6)A RNA methylation is regulated by microRNAs and promotes reprogramming to pluripotency. , 2015, Cell stem cell.

[10]  Rong Chen,et al.  HBPred: a tool to identify growth hormone-binding proteins , 2018, International journal of biological sciences.

[11]  Zhike Lu,et al.  m6A-dependent regulation of messenger RNA stability , 2013, Nature.

[12]  Geoffrey I. Webb,et al.  Cascleave: towards more accurate prediction of caspase substrate cleavage sites , 2010, Bioinform..

[13]  Xing-Ming Zhao,et al.  Cascleave 2.0, a new approach for predicting caspase and granzyme cleavage targets , 2014, Bioinform..

[14]  Jincheng Li,et al.  Feature Extractions for Computationally Predicting Protein Post- Translational Modifications , 2017, Current Bioinformatics.

[15]  Ying Ju,et al.  Pretata: predicting TATA binding proteins with novel features and dimensionality reduction strategy , 2016, BMC Systems Biology.

[16]  Fuhui Long,et al.  Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy , 2003, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[17]  Xingpeng Jiang,et al.  Sequence clustering in bioinformatics: an empirical study. , 2018, Briefings in bioinformatics.

[18]  K. Chou,et al.  pRNAm-PC: Predicting N(6)-methyladenosine sites in RNA sequences via physical-chemical properties. , 2016, Analytical biochemistry.

[19]  Wei Lin,et al.  A comprehensive overview and evaluation of circular RNA detection tools , 2017, PLoS Comput. Biol..

[20]  B. Maden The numerous modified nucleotides in eukaryotic ribosomal RNA. , 1990, Progress in nucleic acid research and molecular biology.

[21]  O. Elemento,et al.  Comprehensive Analysis of mRNA Methylation Reveals Enrichment in 3′ UTRs and near Stop Codons , 2012, Cell.

[22]  Dariusz Mrozek,et al.  An optimal alignment of proteins energy characteristics with crisp and fuzzy similarity awards , 2007, 2007 IEEE International Fuzzy Systems Conference.

[23]  Zhirong Sun,et al.  AthMethPre: a web server for the prediction and query of mRNA m6A sites in Arabidopsis thaliana. , 2016, Molecular bioSystems.

[24]  Q. Cui,et al.  SRAMP: prediction of mammalian N6-methyladenosine (m6A) sites based on sequence-derived features , 2016, Nucleic acids research.

[25]  Bin Liu,et al.  BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches , 2019, Briefings Bioinform..

[26]  Dariusz Mrozek,et al.  search GenBank: interactive orchestration and ad-hoc choreography of Web services in the exploration of the biomedical resources of the National Center For Biotechnology Information , 2013, BMC Bioinformatics.

[27]  Ren Long,et al.  iDHS-EL: identifying DNase I hypersensitive sites by fusing three different modes of pseudo nucleotide composition into an ensemble learning framework , 2016, Bioinform..

[28]  Gaotao Shi,et al.  Fast Prediction of Protein Methylation Sites Using a Sequence-Based Feature Selection Technique , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[29]  Kuo-Chen Chou,et al.  iRSpot-Pse6NC: Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC , 2018, International journal of biological sciences.

[30]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[31]  Jijun Tang,et al.  Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information , 2017, Inf. Sci..

[32]  Gaotao Shi,et al.  CPPred-RF: A Sequence-based Predictor for Identifying Cell-Penetrating Peptides and Their Uptake Efficiency. , 2017, Journal of proteome research.

[33]  Liujuan Cao,et al.  A novel features ranking metric with application to scalable visual and bioinformatics data classification , 2016, Neurocomputing.

[34]  Jiangning Song,et al.  ACPred-FL: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides , 2018, Bioinform..

[35]  Wei Chen,et al.  Identification and analysis of the N6-methyladenosine in the Saccharomyces cerevisiae transcriptome , 2015, Scientific Reports.

[36]  Ke Liu,et al.  RNAMethPre: A Web Server for the Prediction and Query of mRNA m6A Sites , 2016, PloS one.

[37]  M. Kupiec,et al.  Topology of the human and mouse m6A RNA methylomes revealed by m6A-seq , 2012, Nature.

[38]  Howard Y. Chang,et al.  Structure and thermodynamics of N6-methyladenosine in RNA: a spring-loaded base modification. , 2015, Journal of the American Chemical Society.

[39]  Wei Gu,et al.  RNA-MethylPred: A high-accuracy predictor to identify N6-methyladenosine in RNA. , 2016, Analytical biochemistry.

[40]  S. Tavazoie,et al.  N6-methyladenosine marks primary microRNAs for processing , 2015, Nature.

[41]  Junjie Wang,et al.  Deep Forest-based Prediction of Protein Subcellular Localization. , 2018, Current gene therapy.

[42]  Xing Gao,et al.  Integration of deep feature representations and handcrafted features to improve the prediction of N6-methyladenosine sites , 2019, Neurocomputing.

[43]  Ran Su,et al.  Exploring sequence‐based features for the improved prediction of DNA N4‐methylcytosine sites in multiple species , 2018, Bioinform..

[44]  Dariusz Mrozek,et al.  Alignment of protein structure energy patterns represented as sequences of Fuzzy Numbers , 2009, NAFIPS 2009 - 2009 Annual Meeting of the North American Fuzzy Information Processing Society.

[45]  Geoffrey I. Webb,et al.  PROSPER: An Integrated Feature-Based Tool for Predicting Protease Substrate Cleavage Sites , 2012, PloS one.

[46]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[47]  Xiangxiang Zeng,et al.  Integrative approaches for predicting microRNA function and prioritizing disease-related microRNA using biological interaction networks , 2016, Briefings Bioinform..

[48]  Hui Ding,et al.  Prediction of bacteriophage proteins located in the host cell using hybrid features , 2018, Chemometrics and Intelligent Laboratory Systems.

[49]  Geoffrey I. Webb,et al.  iProt-Sub: a comprehensive package for accurately mapping and predicting protease-specific substrates and cleavage sites , 2018, Briefings Bioinform..

[50]  Bing Ren,et al.  N6-methyladenosine-dependent regulation of messenger RNA stability , 2013 .

[51]  Xiaolong Wang,et al.  Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection , 2013, Bioinform..

[52]  Wei Chen,et al.  Sequence-based predictive modeling to identify cancerlectins , 2017, Oncotarget.

[53]  Alfonso Rodríguez-Patón,et al.  Meta-Path Methods for Prioritizing Candidate Disease miRNAs , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[54]  Xiangxiang Zeng,et al.  Prediction and Validation of Disease Genes Using HeteSim Scores , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[55]  Erez Y. Levanon,et al.  m6A mRNA methylation facilitates resolution of naïve pluripotency toward differentiation , 2015, Science.

[56]  Dariusz Mrozek,et al.  Cloud4Psi: cloud computing for 3D protein structure similarity searching , 2014, Bioinform..

[57]  Xiangrong Liu,et al.  An Empirical Study of Features Fusion Techniques for Protein-Protein Interaction Prediction , 2016 .

[58]  Ran Su,et al.  M6APred-EL: A Sequence-Based Predictor for Identifying N6-methyladenosine Sites Using Ensemble Learning , 2018, Molecular therapy. Nucleic acids.

[59]  Fan Yang,et al.  iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC , 2018, Bioinform..

[60]  Jijun Tang,et al.  Prediction of human protein subcellular localization using deep learning , 2017, J. Parallel Distributed Comput..