Transcriptome-wide single molecule mapping of 2’-O-Methylation (Nm) sites in Nanopore direct RNA sequencing datasets using the Nm-nano framework

Nm (2’ s-O-methylation) is one of the most abundant modifications of mRNAs and non-coding RNAs occurring when a methyl group (–CH3) is added to the 2’ hydroxyl (–OH) of the ribose moiety. This modification can appear on any nucleotide (base) regardless of the type of nitrogenous base, because each ribose sugar has a hydroxyl group and so 2’-O-methyl ribose can occur on any base. Nm modification has a great contribution in many biological processes such as the normal functioning of tRNA, the protection of mRNA against degradation by DXO, and the biogenesis and specificity of rRNA. Recently, the single-molecule sequencing techniques for long reads of RNA sequences data offered by Oxford Nanopore technologies have enabled the direct detection of RNA modifications on the molecule that is being sequenced, but to our knowledge there was only one research attempt that applied this technology to predict the stoichiometry of Nm-modified sites in RNA sequence of yeast cells. To this end, in this paper, we extend this research direction by proposing a bio-computational framework, Nm-Nano for predicting Nm sites in Nanopore direct RNA sequencing reads of human cell lines, which are more complex and larger than yeast. Nm-Nano framework integrates two supervised machine learning (ML) models for predicting Nm sites in Nanopore sequencing data, namely the Extreme Gradient Boosting (XGBoost) and Random Forest (RF) with k-mers embedding models. The XGBoost is trained with the features extracted from the modified and unmodified Nanopore signals and their corresponding K-mers resulting from the reported underlying RNA sequence obtained by base-calling, while RF model is trained with the same set of features used to train the XGBoost, in addition to a dense vector representation of RNA k-mers generated by word2vec technique. The results on two benchmark data sets generated from RNA Nanopore sequencing data of Hela and Hek293 human cell lines show a great performance of Nm-Nano. In independent validation testing, Nm-Nano has been able to identify Nm sites with a high accuracy of 93% and 88% using XGBoost and RF with k-mers embedding models respectively by training each model on the Hela benchmark dataset and testing it for identifying Nm sites on Hek293 benchmark dataset. Deploying Nm-Nano to predict Nm sites in Hela cell line revealed that a total of 196 genes were identified as the top frequently Nm-modified genes among all other genes that have been modified by Nm sites in this cell line. The functional and gene set enrichment analysis on these identified genes shows a significant enrichment of a wide range of functional processes in Hela cell line like high confidences (adjusted p-val < 0.05) enriched ontologies that were more representative of Nm modification role in immune response and cellular homeostasis. Similarly, deploying Nm-Nano to predict Nm sites in Hek293 cell line revealed that a total of 176 genes were identified as the top frequently Nm-modified genes in this cell line. The functional and gene set enrichment analysis on these identified genes shows a significant enrichment of a wide range of functional processes in Hek293 cell line like “MHC class 1 protein complex”, “mitotic spindle assembly”, “response to glucocorticoid”, and “nucleocytoplasmic transport”. The source code of Nm-Nano can be freely accessed at https://github.com/Janga-Lab/Nm-Nano.

[1]  Ploy N. Pratanwanich,et al.  Beyond sequencing: machine learning algorithms extract biology hidden in Nanopore signal data. , 2021, Trends in genetics : TIG.

[2]  Leszek P. Pryszcz,et al.  Quantitative profiling of pseudouridylation dynamics in native RNAs with nanopore sequencing , 2021, Nature Biotechnology.

[3]  S. Janga,et al.  Penguin: A Tool for Predicting Pseudouridine Sites in Direct RNA Nanopore Sequencing Data , 2021, bioRxiv.

[4]  Thomas M. Keane,et al.  Twelve years of SAMtools and BCFtools , 2020, GigaScience.

[5]  A. Paramasivam RNA 2′-O-methylation modification and its implication in COVID-19 immunity , 2020, Cell death discovery.

[6]  J. Simpson,et al.  New Twists in Detecting mRNA Modification Dynamics , 2020, Trends in Biotechnology.

[7]  Yuan Zhou,et al.  NmSEER V2.0: a prediction tool for 2′-O-methylation sites based on random forest and multi-encoding combination , 2019, BMC Bioinformatics.

[8]  O. Ilkayeva,et al.  Modification of messenger RNA by 2′-O-methylation regulates gene expression in vivo , 2019, Nature Communications.

[9]  Manasses Jora,et al.  Detection of ribonucleoside modifications by liquid chromatography coupled with mass spectrometry. , 2019, Biochimica et biophysica acta. Gene regulatory mechanisms.

[10]  L. Teysset,et al.  RNA 2′-O-Methylation (Nm) Modification in Human Diseases , 2019, Genes.

[11]  B. H. Shekar,et al.  Grid Search-Based Hyperparameter Tuning and Classification of Microarray Cancer Data , 2019, 2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP).

[12]  M. Helm,et al.  RNA Modifications Modulate Activation of Innate Toll-Like Receptors , 2019, Genes.

[13]  Yufei Huang,et al.  Deep-2'-O-Me: Predicting 2'-O-methylation sites by Convolutional Neural Networks , 2018, 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).

[14]  Maxime C. Beaudoin,et al.  2'-O-methylation of the mRNA cap protects RNAs from decapping and degradation by DXO , 2018, PloS one.

[15]  Yohann Couté,et al.  Evidence for rRNA 2′-O-methylation plasticity: Control of intrinsic translational capabilities of human ribosomes , 2017, Proceedings of the National Academy of Sciences.

[16]  Heng Li,et al.  Minimap2: pairwise alignment for nucleotide sequences , 2017, Bioinform..

[17]  Gideon Rechavi,et al.  Nm-seq maps 2′-O-methylation sites in human mRNA with base precision , 2017, Nature Methods.

[18]  Yinzhou Zhu,et al.  High-throughput and site-specific identification of 2′-O-methylation sites using ribose oxidation sequencing (RibOxi-seq) , 2017, RNA.

[19]  Patrick Ng,et al.  dna2vec: Consistent vector representations of variable-length k-mers , 2017, ArXiv.

[20]  Wei Chen,et al.  Identifying 2'-O-methylationation sites by integrating nucleotide chemical properties and nucleotide compositions. , 2016, Genomics.

[21]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[22]  Zornitza Stark,et al.  Defects in tRNA Anticodon Loop 2′‐O‐Methylation Are Implicated in Nonsyndromic X‐Linked Intellectual Disability due to Mutations in FTSJ1 , 2015, Human mutation.

[23]  H. Schwalbe,et al.  Structural basis for regulation of ribosomal RNA 2'-o-methylation. , 2014, Angewandte Chemie.

[24]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[25]  Pornpimol Charoentong,et al.  ClueGO: a Cytoscape plug-in to decipher functionally grouped gene ontology and pathway annotation networks , 2009, Bioinform..

[26]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[27]  Tamás Kiss,et al.  Cajal body‐specific small nuclear RNAs: a novel class of 2′‐O‐methylation and pseudouridylation guide RNAs , 2002, The EMBO journal.

[28]  A. Metspalu,et al.  Locations of several novel 2'-O-methylated nucleotides in human 28S rRNA , 2002, BMC Molecular Biology.

[29]  L. Breiman Random Forests , 2001, Encyclopedia of Machine Learning and Data Mining.

[30]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[31]  H. Nielsen,et al.  RiboMeth-seq: Profiling of 2'-O-Me in RNA. , 2017, Methods in molecular biology.

[32]  Bi-Feng Yuan,et al.  Liquid Chromatography-Mass Spectrometry for Analysis of RNA Adenosine Methylation. , 2017, Methods in molecular biology.

[33]  Yanjun Qi Random Forest for Bioinformatics , 2012 .