iRNA5hmC: The First Predictor to Identify RNA 5-Hydroxymethylcytosine Modifications Using Machine Learning

RNA 5-hydroxymethylcytosine (5hmC) modification plays an important role in a series of biological processes. Characterization of its distributions in transcriptome is fundamentally important to reveal the biological functions of 5hmC. Sequencing-based technologies allow the high-throughput identification of 5hmC; however, they are labor-intensive, time-consuming, as well as expensive. Thus, there is an urgent need to develop more effective and efficient computational methods, at least complementary to the high-throughput technologies. In this study, we developed iRNA5hmC, a computational predictive protocol to identify RNA 5hmC sites using machine learning. In this predictor, we introduced a sequence-based feature algorithm consisting of two feature representations, (1) k-mer spectrum and (2) positional nucleotide binary vector, to capture the sequential characteristics of 5hmC sites. Afterward, we utilized a two-stage feature space optimization strategy to improve the feature representation ability, and trained a predictive model using support vector machine (SVM). Our feature analysis results showed that feature optimization can help to capture the most discriminative features. As compared to well-known existing feature descriptors, our proposed representations can more accurately separate true 5hmC from non-5hmC sites. To the best of our knowledge, iRNA5hmC is the first RNA 5hmC predictor that enables to make predictions based on RNA primary sequences only, without any need of prior experimental knowledge. Importantly, we have established an easy-to-use webserver which is currently available at http://server.malab.cn/iRNA5hmC. We expect it has potential to be a useful tool for the prediction of 5hmC sites.

[1]  Ping Wang,et al.  Discriminating ramos and jurkat cells with image textures from diffraction imaging flow cytometry based on a support vector machine , 2016 .

[2]  Shankar Balasubramanian,et al.  Formation and Abundance of 5-Hydroxymethylcytosine in RNA , 2015, Chembiochem : a European journal of chemical biology.

[3]  Kai Li,et al.  iPromoter-2L2.0: Identifying Promoters and Their Types by Combining Smoothing Cutting Window Algorithm and Sequence-Based Features , 2019, Molecular therapy. Nucleic acids.

[4]  Ying Bi,et al.  Bisulfite-free and base-resolution analysis of 5-methylcytidine and 5-hydroxymethylcytidine in RNA with peroxotungstate , 2019, Chemical communications.

[5]  I. Rácz,et al.  Effect of light on the nucleotide composition of rRNA of wheat seedlings , 2004, Planta.

[6]  Hui Ding,et al.  A Random Forest Sub-Golgi Protein Classifier Optimized via Dipeptide and Amino Acid Composition Features , 2019, Front. Bioeng. Biotechnol..

[7]  Weiwei Li,et al.  Distribution of 5-Hydroxymethylcytosine in Different Human Tissues , 2011, Journal of nucleic acids.

[8]  Tao Pan,et al.  Dynamic RNA Modifications in Gene Expression Regulation , 2017, Cell.

[9]  Hao Lv,et al.  iRNA-m7G: Identifying N7-methylguanosine Sites by Fusing Multiple Features , 2019, Molecular therapy. Nucleic acids.

[10]  Vladimir Vacic,et al.  Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments , 2006, Bioinform..

[11]  Ran Su,et al.  Iterative feature representations improve N4-methylcytosine site prediction , 2019, Bioinform..

[12]  Janusz M. Bujnicki,et al.  MODOMICS: a database of RNA modification pathways. 2017 update , 2017, Nucleic Acids Res..

[13]  Chuan He,et al.  Where, When, and How: Context-Dependent Functions of RNA Methylation Writers, Readers, and Erasers. , 2019, Molecular cell.

[14]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[15]  Suresh Kumar,et al.  Epigenetics of Modified DNA Bases: 5-Methylcytosine and Beyond , 2018, Front. Genet..

[16]  Shuigeng Zhou,et al.  Predicting Enhancers from Multiple Cell Lines and Tissues across Different Developmental Stages Based On SVM Method , 2018, Current Bioinformatics.

[17]  Bin Liu,et al.  BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches , 2019, Briefings Bioinform..

[18]  Hilla Peretz,et al.  The , 1966 .

[19]  G. Pfeifer,et al.  Tet-Mediated Formation of 5-Hydroxymethylcytosine in RNA , 2014, Journal of the American Chemical Society.

[20]  K. Chou,et al.  iACP: a sequence-based tool for identifying anticancer peptides , 2016, Oncotarget.

[21]  Fei Wang,et al.  Transcriptome-wide distribution and function of RNA hydroxymethylcytosine , 2016, Science.

[22]  Yan He,et al.  Classification of Small GTPases with Hybrid Protein Features and Advanced Machine Learning Techniques , 2017, Current Bioinformatics.

[23]  Wei Chen,et al.  Identification of Antioxidants from Sequence Information Using Naïve Bayes , 2013, Comput. Math. Methods Medicine.

[24]  Jizhen Li,et al.  5-hydroxymethylcytosine is detected in RNA from mouse brain tissues , 2016, Brain Research.

[25]  Jung-Hoon Yoon,et al.  Genetic Control of Replication through N1-methyladenine in Human Cells* , 2015, The Journal of Biological Chemistry.

[26]  John S. Mattick,et al.  The RNA modification landscape in human disease , 2017, RNA.

[27]  A. Wayne Whitney,et al.  A Direct Method of Nonparametric Measurement Selection , 1971, IEEE Transactions on Computers.

[28]  Gaotao Shi,et al.  CPPred-RF: A Sequence-based Predictor for Identifying Cell-Penetrating Peptides and Their Uptake Efficiency. , 2017, Journal of proteome research.

[29]  Han Zhang,et al.  BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches , 2019, Nucleic acids research.

[30]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[31]  Quan Zou,et al.  Incorporating Distance-based Top-n-gram and Random Forest to Identify Electron Transport Proteins. , 2019, Journal of proteome research.

[32]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[33]  Bin Liu,et al.  MotifCNN-fold: protein fold recognition based on fold-specific features extracted by motif-based convolutional neural networks , 2019, Briefings Bioinform..

[34]  Marcin Feder,et al.  MODOMICS: a database of RNA modification pathways , 2005, Nucleic Acids Res..

[35]  Fei Li,et al.  MM-6mAPred: identifying DNA N6-methyladenine sites based on Markov model , 2019, Bioinform..

[36]  Wei Chen,et al.  PseKNC-General: a cross-platform package for generating various modes of pseudo nucleotide compositions , 2015, Bioinform..