BERT2OME: Prediction of 2′-O-Methylation Modifications From RNA Sequence by Transformer Architecture Based on BERT

Recent work on language models has resulted in state-of-the-art performance on various language tasks. Among these, Bidirectional Encoder Representations from Transformers (BERT) has focused on contextualizing word embeddings to extract context and semantics of the words. On the other hand, post-transcriptional 2′-O-methylation (Nm) RNA modification is important in various cellular tasks and related to a number of diseases. The existing high-throughput experimental techniques take longer time to detect these modifications, and costly in exploring these functional processes. Here, to deeply understand the associated biological processes faster, we come up with an efficient method Bert2Ome to infer 2′-O-methylation RNA modification sites from RNA sequences. Bert2Ome combines BERT-based model with convolutional neural networks (CNN) to infer the relationship between the modification sites and RNA sequence content. Unlike the methods proposed so far, Bert2Ome assumes each given RNA sequence as a text and focuses on improving the modification prediction performance by integrating the pretrained deep learning-based language model BERT. Additionally, our transformer-based approach could infer modification sites across multiple species. According to 5-fold cross-validation, human and mouse accuracies were 99.15% and 94.35% respectively. Similarly, ROC AUC scores were 0.99, 0.94 for the same species. Detailed results show that Bert2Ome reduces the time consumed in biological experiments and outperforms the existing approaches across different datasets and species over multiple metrics. Additionally, deep learning approaches such as 2D CNNs are more promising in learning BERT attributes than more conventional machine learning methods.

[1]  Yanjing Sun,et al.  EMDLP: Ensemble multiscale deep learning model for RNA methylation site prediction , 2022, BMC Bioinformatics.

[2]  Xiaoyong Pan,et al.  ELMo4m6A: A Contextual Language Embedding-Based Predictor for Detecting RNA N6-Methyladenosine Sites , 2022, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[3]  Emre Sefer ProbC: joint modeling of epigenome and transcriptome effects in 3D genome , 2022, BMC Genomics.

[4]  Q. Zou,et al.  NmRF: identification of multispecies RNA 2'-O-methylation modification sites from RNA sequences , 2021, Briefings Bioinform..

[5]  Yuzong Chen,et al.  Therapeutic target database update 2022: facilitating drug discovery with enriched comparative data of targeted agents , 2021, Nucleic Acids Res..

[6]  D. Rigden,et al.  Attention-based multi-label neural networks for integrated prediction and interpretation of twelve widely occurring RNA modifications , 2021, Nature Communications.

[7]  Xiuyu Li,et al.  EDLm6APred: ensemble deep learning approach for mRNA m6A site prediction , 2021, BMC Bioinformatics.

[8]  Jian Ren,et al.  DeepOMe: A Web Server for the Prediction of 2′-O-Me Sites Based on the Hybrid CNN and BLSTM Architecture , 2021, Frontiers in Cell and Developmental Biology.

[9]  Guohua Wang,et al.  Effects of DNA Methylation on TFs in Human Embryonic Stem Cells , 2021, Frontiers in Genetics.

[10]  Yu-Yen Ou,et al.  A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information , 2021, Briefings Bioinform..

[11]  Lei Wang,et al.  A Convolutional Neural Network Using Dinucleotide One-hot Encoder for identifying DNA N6-Methyladenine Sites in the Rice Genome , 2021, Neurocomputing.

[12]  Carl Kingsford,et al.  Metric Labeling and Semimetric Embedding for Protein Annotation Prediction. , 2020, Journal of computational biology : a journal of computational molecular cell biology.

[13]  Timothy L. Bailey,et al.  STREME: Accurate and versatile sequence motif discovery , 2020, bioRxiv.

[14]  Jionglong Su,et al.  RMDisease: a database of genetic variants that affect RNA modifications, with implications for epitranscriptome pathogenesis , 2020, Nucleic Acids Res..

[15]  Zhihan Zhou,et al.  DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome , 2020, bioRxiv.

[16]  Xiujuan Lei,et al.  Bioinformatics approaches for deciphering the epitranscriptome: Recent progress and emerging topics , 2020, Computational and structural biotechnology journal.

[17]  Wei Chen,et al.  iMRM: a platform for simultaneously identifying multiple kinds of RNA modifications , 2020, Bioinform..

[18]  Guohua Wang,et al.  Identification of methylation states of DNA regions for Illumina methylation BeadChip , 2020, BMC Genomics.

[19]  Jijun Tang,et al.  DeepAVP: A Dual-Channel Deep Neural Network for Identifying Variable-Length Antiviral Peptides , 2020, IEEE Journal of Biomedical and Health Informatics.

[20]  Yu-Yen Ou,et al.  DeepETC: A deep convolutional neural network architecture for investigating and classifying electron transport chain's complexes , 2020, Neurocomputing.

[21]  Yuan Zhou,et al.  NmSEER V2.0: a prediction tool for 2′-O-methylation sites based on random forest and multi-encoding combination , 2019, BMC Bioinformatics.

[22]  Burkhard Rost,et al.  Modeling aspects of the language of life through transfer-learning protein sequences , 2019, BMC Bioinformatics.

[23]  Siquan Hu,et al.  An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences , 2019, PloS one.

[24]  Feng Zhu,et al.  Therapeutic target database 2020: enriched resource for facilitating research and early development of targeted therapeutics , 2019, Nucleic Acids Res..

[25]  Feng Zhu,et al.  VARIDT 1.0: variability of drug transporter database , 2019, Nucleic Acids Res..

[26]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[27]  Yu-Yen Ou,et al.  iEnhancer-5Step: Identifying enhancers using hidden information of DNA sequences via Chou's 5-step rule and word embedding. , 2019, Analytical biochemistry.

[28]  Kil To Chong,et al.  iRNA-PseKNC(2methyl): Identify RNA 2'-O-methylation sites by convolution neural network and Chou's pseudo components. , 2019, Journal of theoretical biology.

[29]  Wei Chen,et al.  iRNA-2OM: A Sequence-Based Predictor for Identifying 2′-O-Methylation Sites in Homo sapiens , 2018, J. Comput. Biol..

[30]  Liang Fang,et al.  Imbalance learning for the prediction of N6-Methylation sites in mRNAs , 2018, BMC Genomics.

[31]  Xiangxiang Zeng,et al.  Prediction of potential disease-associated microRNAs using structural perturbation method , 2017, bioRxiv.

[32]  Shun Liu,et al.  RMBase v2.0: deciphering the map of RNA modifications from epitranscriptome sequencing data , 2017, Nucleic Acids Res..

[33]  F. Wang,et al.  Methods of MicroRNA Promoter Prediction and Transcription Factor Mediated Regulatory Network , 2017, BioMed research international.

[34]  Francesco Neri,et al.  High-throughput single-base resolution mapping of RNA 2΄-O-methylated residues , 2016, Nucleic acids research.

[35]  Wei Chen,et al.  Identifying 2'-O-methylationation sites by integrating nucleotide chemical properties and nucleotide compositions. , 2016, Genomics.

[36]  Chengqi Yi,et al.  Transcriptome-wide mapping reveals reversible and dynamic N(1)-methyladenosine methylome. , 2016, Nature chemical biology.

[37]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[38]  Q. Zou,et al.  Integrative approaches for predicting microRNA function and prioritizing disease-related microRNA using biological interaction networks , 2016, Briefings Bioinform..

[39]  Ehsaneddin Asgari,et al.  Continuous Distributed Representation of Biological Sequences for Deep Proteomics and Genomics , 2015, PloS one.

[40]  Fang Wang,et al.  MicroRNA Promoter Identification in Arabidopsis Using Multiple Histone Markers , 2015, BioMed research international.

[41]  Geet Duggal,et al.  Deconvolution of Ensemble Chromatin Interaction Data Reveals the Latent Mixing Structures in Cell Subpopulations , 2015, RECOMB.

[42]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[43]  Piotr Duda,et al.  The CART decision tree for mining data streams , 2014, Inf. Sci..

[44]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[45]  Liang-Hu Qu,et al.  RTL-P: a sensitive approach for detecting sites of 2′-O-methylation in RNA molecules , 2012, Nucleic acids research.

[46]  Burkhard Ludewig,et al.  Ribose 2′-O-methylation provides a molecular signature for the distinction of self and non-self mRNA dependent on the RNA sensor Mda5 , 2011, Nature Immunology.

[47]  A. Hüttenhofer,et al.  The expanding snoRNA world. , 2002, Biochimie.

[48]  T. Kiss Small Nucleolar RNAs An Abundant Group of Noncoding RNAs with Diverse Cellular Functions , 2002, Cell.

[49]  J. Steitz,et al.  A new method for detecting sites of 2'-O-methylation in RNA molecules. , 1997, RNA.

[50]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[51]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[52]  L. Breiman Random Forests , 2001, Machine Learning.

[53]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..