Evaluation of different computational methods on 5-methylcytosine sites identification

5-Methylcytosine (m5C) plays an extremely important role in the basic biochemical process. With the great increase of identified m5C sites in a wide variety of organisms, their epigenetic roles become largely unknown. Hence, accurate identification of m5C site is a key step in understanding its biological functions. Over the past several years, more attentions have been paid on the identification of m5C sites in multiple species. In this work, we firstly summarized the current progresses in computational prediction of m5C sites and then constructed a more powerful and reliable model for identifying m5C sites. To train the model, we collected experimentally confirmed m5C data from Homo sapiens, Mus musculus, Saccharomyces cerevisiae and Arabidopsis thaliana, and compared the performances of different feature extraction methods and classification algorithms for optimizing prediction model. Based on the optimal model, a novel predictor called iRNA-m5C was developed for the recognition of m5C sites. Finally, we critically evaluated the performance of iRNA-m5C and compared it with existing methods. The result showed that iRNA-m5C could produce the best prediction performance. We hope that this paper could provide a guide on the computational identification of m5C site and also anticipate that the proposed iRNA-m5C will become a powerful tool for large scale identification of m5C sites.

[1]  Wei Chen,et al.  iRNA-2OM: A Sequence-Based Predictor for Identifying 2′-O-Methylation Sites in Homo sapiens , 2018, J. Comput. Biol..

[2]  Miao Sun,et al.  QAcon: single model quality assessment using protein structural and contact information with machine learning techniques , 2016, Bioinform..

[3]  Schraga Schwartz,et al.  Transcriptome-Wide Mapping of 5-methylcytidine RNA Modifications in Bacteria, Archaea, and Yeast Reveals m5C within Archaeal mRNAs , 2013, PLoS genetics.

[4]  Gideon Rechavi,et al.  RNA modifications: what have we learned and where are we headed? , 2016, Nature Reviews Genetics.

[5]  Meng Zhou,et al.  MetSigDis: a manually curated resource for the metabolic signatures of diseases , 2019, Briefings Bioinform..

[6]  Dong Si,et al.  Deep Learning to Predict Protein Backbone Structure from High-Resolution Cryo-EM Density Maps , 2019, bioRxiv.

[7]  Geoffrey I. Webb,et al.  iProt-Sub: a comprehensive package for accurately mapping and predicting protease-specific substrates and cleavage sites , 2018, Briefings Bioinform..

[8]  Wei Chen,et al.  i6mA-Pred: identifying DNA N6-methyladenine sites in the rice genome , 2019, Bioinform..

[9]  Balachandran Manavalan,et al.  Machine-Learning-Based Prediction of Cell-Penetrating Peptides and Their Uptake Efficiency with Improved Accuracy. , 2018, Journal of proteome research.

[10]  Ian H. Witten,et al.  Data mining in bioinformatics using Weka , 2004, Bioinform..

[11]  Michal Jakubczyk,et al.  A framework for sensitivity analysis of decision trees , 2017, Central European Journal of Operations Research.

[12]  Q. Zou,et al.  Protein Folds Prediction with Hierarchical Structured SVM , 2016 .

[13]  Yan Huang,et al.  RNAm5Cfinder: A Web-server for Predicting RNA 5-methylcytosine (m5C) Sites Based on Random Forest , 2018, Scientific Reports.

[14]  Xiang Cheng,et al.  iDrug-Target: predicting the interactions between drug compounds and target proteins in cellular networking via benchmark dataset optimization approach , 2015, Journal of biomolecular structure & dynamics.

[15]  K. Chou,et al.  Prediction of protein structural classes. , 1995, Critical reviews in biochemistry and molecular biology.

[16]  P. Agris,et al.  5-Methylcytidine is required for cooperative binding of Mg2+ and a conformational transition at the anticodon stem-loop of yeast phenylalanine tRNA. , 1993, Biochemistry.

[17]  Kuo-Chen Chou,et al.  iRNAm5C-PseDNC: identifying RNA 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition , 2017, Oncotarget.

[18]  David A. Landgrebe,et al.  A survey of decision tree classifier methodology , 1991, IEEE Trans. Syst. Man Cybern..

[19]  Jun Li,et al.  Transcriptome-Wide Mapping of RNA 5-Methylcytosine in Arabidopsis mRNAs and Noncoding RNAs , 2017, Plant Cell.

[20]  L. Vardy,et al.  5-Methylcytosine RNA Methylation in Arabidopsis Thaliana. , 2017, Molecular plant.

[21]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[22]  Q. Zou,et al.  Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA , 2018, RNA.

[23]  Guangpeng Li,et al.  PseKRAAC: a flexible web server for generating pseudo K-tuple reduced amino acids composition , 2017, Bioinform..

[24]  Christopher J. C. Burges,et al.  A Tutorial on Support Vector Machines for Pattern Recognition , 1998, Data Mining and Knowledge Discovery.

[25]  Wei Chen,et al.  PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions , 2015, Bioinform..

[26]  Wei Chen,et al.  Naïve Bayes Classifier with Feature Selection to Identify Phage Virion Proteins , 2013, Comput. Math. Methods Medicine.

[27]  Cangzhi Jia,et al.  S-SulfPred: A sensitive predictor to capture S-sulfenylation sites based on a resampling one-sided selection undersampling-synthetic minority oversampling technique. , 2017, Journal of theoretical biology.

[28]  Renzhi Cao,et al.  3Drefine: an interactive web server for efficient protein structure refinement , 2016, Nucleic Acids Res..

[29]  Wei Chen,et al.  Identifying RNA 5-methylcytosine sites via pseudo nucleotide compositions. , 2016, Molecular bioSystems.

[30]  J. Concato,et al.  A simulation study of the number of events per variable in logistic regression analysis. , 1996, Journal of clinical epidemiology.

[31]  Wei Chen,et al.  Classifying Included and Excluded Exons in Exon Skipping Event Using Histone Modifications , 2018, Front. Genet..

[32]  Quan Zou,et al.  O‐GlcNAcPRED‐II: an integrated classification algorithm for identifying O‐GlcNAcylation sites based on fuzzy undersampling and a K‐means PCA oversampling technique , 2018, Bioinform..

[33]  Martin Koš,et al.  A cluster of methylations in the domain IV of 25S rRNA is required for ribosome stability , 2014, RNA.

[34]  L. E. McDonald,et al.  A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. , 1992, Proceedings of the National Academy of Sciences of the United States of America.

[35]  Cheng Peng,et al.  Novel naïve Bayes classification models for predicting the carcinogenicity of chemicals. , 2016, Food and chemical toxicology : an international journal published for the British Industrial Biological Research Association.

[36]  Ying Ju,et al.  Predicting Diabetes Mellitus With Machine Learning Techniques , 2018, Front. Genet..

[37]  B. Hong,et al.  Nop2p is required for pre-rRNA processing and 60S ribosome subunit synthesis in yeast , 1997, Molecular and cellular biology.

[38]  Hongtao Zhang,et al.  Artificial Intelligence on Job-Hopping Forecasting: AI on Job-Hopping , 2018, 2018 Portland International Conference on Management of Engineering and Technology (PICMET).

[39]  Jie Sun,et al.  DincRNA: a comprehensive web-based bioinformatics toolkit for exploring disease associations and ncRNA function , 2018, Bioinform..

[40]  Wei Chen,et al.  Predicting peroxidase subcellular location by hybridizing different descriptors of Chou' pseudo amino acid patterns. , 2014, Analytical biochemistry.

[41]  Mukhtaj Khan,et al.  Identifying 5-methylcytosine sites in RNA sequence using composite encoding feature into Chou's PseKNC. , 2018, Journal of theoretical biology.

[42]  Chenglong Yu,et al.  A Novel Method of Characterizing Genetic Sequences: Genome Space with Biological Distance and Applications , 2011, PloS one.

[43]  Renzhi Cao,et al.  Protein tertiary structure modeling driven by deep learning and contact distance prediction in CASP13 , 2019, Proteins.

[44]  Gwang Lee,et al.  PVP-SVM: Sequence-Based Prediction of Phage Virion Proteins Using a Support Vector Machine , 2018, Front. Microbiol..

[45]  Ming Zhang,et al.  Accurate RNA 5-methylcytosine site prediction based on heuristic physical-chemical properties reduction and classifier ensemble. , 2018, Analytical biochemistry.

[46]  Renzhi Cao,et al.  Protein single-model quality assessment by feature-based probability density functions , 2016, Scientific Reports.

[47]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[48]  M. Gorospe,et al.  The tRNA methyltransferase NSun2 stabilizes p16INK4 mRNA by methylating the 3′-untranslated region of p16 , 2012, Nature Communications.

[49]  Samir Adhikari,et al.  5-methylcytosine promotes mRNA export — NSUN2 as the methyltransferase and ALYREF as an m5C reader , 2017, Cell Research.

[50]  Yujia Song,et al.  Transcriptome-Wide Annotation of m5C RNA Modifications Using Machine Learning , 2018, Front. Plant Sci..

[51]  Hao Lin,et al.  Identifying Sigma70 Promoters with Novel Pseudo Nucleotide Composition , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[52]  Yue Zhao,et al.  RAID v2.0: an updated resource of RNA-associated interactions across organisms , 2016, Nucleic Acids Res..

[53]  B. Barrell,et al.  Life with 6000 Genes , 1996, Science.

[54]  Kuo-Chen Chou,et al.  RSARF: prediction of residue solvent accessibility from protein sequence using random forest method. , 2012, Protein and peptide letters.

[55]  Hua Tang,et al.  A two-step discriminated method to identify thermophilic proteins , 2017 .

[56]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[57]  Wanying Xu,et al.  OAHG: an integrated resource for annotating human genes with multi-level ontologies , 2016, Scientific Reports.

[58]  Xingpeng Jiang,et al.  Sequence clustering in bioinformatics: an empirical study. , 2018, Briefings in bioinformatics.

[59]  Francesca Tuorto,et al.  RNA methylation by Dnmt2 protects transfer RNAs against stress-induced cleavage. , 2010, Genes & development.

[60]  K. Chou,et al.  PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. , 2014, Analytical biochemistry.

[61]  Zhangxin Chen,et al.  ProLanGO: Protein Function Prediction Using Neural Machine Translation Based on a Recurrent Neural Network , 2017, Molecules.

[62]  Xiaozhao Fang,et al.  Protein fold recognition based on multi-view modeling , 2019, Bioinform..

[63]  K. Chou,et al.  iDNA-Methyl: identifying DNA methylation sites via pseudo trinucleotide composition. , 2015, Analytical biochemistry.

[64]  Chengqi Yi,et al.  Epitranscriptome sequencing technologies: decoding RNA modifications , 2016, Nature Methods.

[65]  M. Biggiogera,et al.  Ultrastructural localization of 5-methylcytosine on DNA and RNA , 2017, Cellular and Molecular Life Sciences.

[66]  Rong Chen,et al.  HBPred: a tool to identify growth hormone-binding proteins , 2018, International journal of biological sciences.

[67]  Qinghua Guo,et al.  LncRNA2Target v2.0: a comprehensive database for target genes of lncRNAs in human and mouse , 2018, Nucleic Acids Res..

[68]  Wei Chen,et al.  Identification of Antioxidants from Sequence Information Using Naïve Bayes , 2013, Comput. Math. Methods Medicine.

[69]  Michaela Frye,et al.  The Mouse Cytosine-5 RNA Methyltransferase NSun2 Is a Component of the Chromatoid Body and Required for Testis Differentiation , 2013, Molecular and Cellular Biology.

[70]  Ping Zhu,et al.  MimoDB 2.0: a mimotope database and beyond , 2011, Nucleic Acids Res..

[71]  Jie Wu,et al.  RMBase: a resource for decoding the landscape of RNA modifications from high-throughput sequencing data , 2015, Nucleic Acids Res..

[72]  Xiaolong Wang,et al.  A comprehensive review and comparison of existing computational methods for intrinsically disordered protein and region prediction , 2019, Briefings Bioinform..

[73]  Balachandran Manavalan,et al.  DHSpred: support-vector-machine-based human DNase I hypersensitive sites prediction using the optimal features selected by random forest , 2017, bioRxiv.

[74]  Finn Verner Jensen,et al.  Introduction to Bayesian Networks , 2008, Innovations in Bayesian Networks.

[75]  S. Yau,et al.  Convex hull analysis of evolutionary and phylogenetic relationships between biological groups. , 2018, Journal of theoretical biology.

[76]  Weifeng Gu,et al.  Rapid tRNA decay can result from lack of nonessential modifications. , 2006, Molecular cell.

[77]  Gholamreza Haffari,et al.  PROSPERous: high-throughput prediction of substrate cleavage sites for 90 proteases with improved accuracy , 2018, Bioinform..

[78]  Yan Lin,et al.  iTerm-PseKNC: a sequence-based tool for predicting bacterial transcriptional terminators , 2018, Bioinform..

[79]  Wei Chen,et al.  Pro54DB: a database for experimentally verified sigma‐54 promoters , 2016, Bioinform..

[80]  K. Chou,et al.  Recent progress in protein subcellular location prediction. , 2007, Analytical biochemistry.

[81]  Xin Chen,et al.  DMINDA 2.0: integrated and systematic views of regulatory DNA motif identification and analyses , 2017, Bioinform..

[82]  Chuan He,et al.  Post-transcriptional gene regulation by mRNA modifications , 2016, Nature Reviews Molecular Cell Biology.

[83]  Xing Gao,et al.  Integration of deep feature representations and handcrafted features to improve the prediction of N6-methyladenosine sites , 2019, Neurocomputing.

[84]  Ran Su,et al.  Exploring sequence‐based features for the improved prediction of DNA N4‐methylcytosine sites in multiple species , 2018, Bioinform..

[85]  Hao Lv,et al.  Identify origin of replication in Saccharomyces cerevisiae using two-step feature selection technique , 2018, Bioinform..

[86]  D. Cox The Regression Analysis of Binary Sequences , 1958 .

[87]  Hui Zhang,et al.  Novel naïve Bayes classification models for predicting the chemical Ames mutagenicity. , 2017, Toxicology in vitro : an international journal published in association with BIBRA.