RF-PseU: A Random Forest Predictor for RNA Pseudouridine Sites

One of the ubiquitous chemical modifications in RNA, pseudouridine modification is crucial for various cellular biological and physiological processes. To gain more insight into the functional mechanisms involved, it is of fundamental importance to precisely identify pseudouridine sites in RNA. Several useful machine learning approaches have become available recently, with the increasing progress of next-generation sequencing technology; however, existing methods cannot predict sites with high accuracy. Thus, a more accurate predictor is required. In this study, a random forest-based predictor named RF-PseU is proposed for prediction of pseudouridylation sites. To optimize feature representation and obtain a better model, the light gradient boosting machine algorithm and incremental feature selection strategy were used to select the optimum feature space vector for training the random forest model RF-PseU. Compared with previous state-of-the-art predictors, the results on the same benchmark data sets of three species demonstrate that RF-PseU performs better overall. The integrated average leave-one-out cross-validation and independent testing accuracy scores were 71.4% and 74.7%, respectively, representing increments of 3.63% and 4.77% versus the best existing predictor. Moreover, the final RF-PseU model for prediction was built on leave-one-out cross-validation and provides a reliable and robust tool for identifying pseudouridine sites. A web server with a user-friendly interface is accessible at http://148.70.81.170:10228/rfpseu.

[1]  J. Hanna,et al.  m6A modification controls the innate immune response to infection by targeting type I interferons , 2018, Nature Immunology.

[2]  W. Gilbert,et al.  Pseudouridine profiling reveals regulated mRNA pseudouridylation in yeast and human cells , 2014, Nature.

[3]  Fei Guo,et al.  MDA-SKF: Similarity Kernel Fusion for Accurately Discovering miRNA-Disease Association , 2018, Front. Genet..

[4]  Jijun Tang,et al.  Identification of protein subcellular localization via integrating evolutionary and physicochemical information into Chou's general PseAAC. , 2019, Journal of theoretical biology.

[5]  Jiangning Song,et al.  MULTiPly: a novel multi-layer predictor for discovering general and specific types of promoters , 2019, Bioinform..

[6]  Cangzhi Jia,et al.  4mCPred: machine learning methods for DNA N4‐methylcytosine sites prediction , 2018, Bioinform..

[7]  Junjie Chen,et al.  Pse-in-One: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences , 2015, Nucleic Acids Res..

[8]  Feng Zhu,et al.  Therapeutic target database 2020: enriched resource for facilitating research and early development of targeted therapeutics , 2019, Nucleic Acids Res..

[9]  R. E. Lincoln Control of stock culture preservation and inoculum build‐up in bacterial fermentation , 1960 .

[10]  Kuo-Chen Chou,et al.  iRNAm5C-PseDNC: identifying RNA 5-methylcytosine sites by incorporating physical-chemical properties into pseudo dinucleotide composition , 2017, Oncotarget.

[11]  Hilal Tayara,et al.  iPseU-CNN: Identifying RNA Pseudouridine Sites Using Convolutional Neural Networks , 2019, Molecular therapy. Nucleic acids.

[12]  Wei Tang,et al.  Tumor origin detection with tissue‐specific miRNA and DNA methylation markers , 2018, Bioinform..

[13]  Dong-Qing Wei,et al.  Prediction of CYP450 Enzyme-Substrate Selectivity Based on the Network-Based Label Space Division Method , 2019, J. Chem. Inf. Model..

[14]  Meng Zhou,et al.  MetSigDis: a manually curated resource for the metabolic signatures of diseases , 2019, Briefings Bioinform..

[15]  Yunpeng Zhang,et al.  Identification of Cancer Dysfunctional Subpathways by Integrating DNA Methylation, Copy Number Variation, and Gene-Expression Data , 2019, Front. Genet..

[16]  William Stafford Noble,et al.  Machine learning applications in genetics and genomics , 2015, Nature Reviews Genetics.

[17]  Chengqi Yi,et al.  Chemical Modifications to RNA: A New Layer of Gene Expression Regulation. , 2017, ACS chemical biology.

[18]  Liang Cheng,et al.  Human Disease System Biology. , 2018, Current gene therapy.

[19]  Liujuan Cao,et al.  A novel features ranking metric with application to scalable visual and bioinformatics data classification , 2016, Neurocomputing.

[20]  Liang Cheng,et al.  gutMDisorder: a comprehensive database for dysbiosis of the gut microbiota in disorders and interventions , 2019, Nucleic acids research.

[21]  Shiqing Ma,et al.  Chemical pulldown reveals dynamic pseudouridylation of the mammalian transcriptome. , 2015, Nature chemical biology.

[22]  Bing Ren,et al.  N6-methyladenosine-dependent regulation of messenger RNA stability , 2013 .

[23]  Wei Tao,et al.  A comprehensive comparison and analysis of computational predictors for RNA N6-methyladenosine sites of Saccharomyces cerevisiae. , 2019, Briefings in functional genomics.

[24]  Mukhtaj Khan,et al.  Identifying 5-methylcytosine sites in RNA sequence using composite encoding feature into Chou's PseKNC. , 2018, Journal of theoretical biology.

[25]  Kuo-Chen Chou,et al.  iPTM-mLys: identifying multiple lysine PTM sites and their different types , 2016, Bioinform..

[26]  Janusz M. Bujnicki,et al.  MODOMICS: a database of RNA modification pathways. 2017 update , 2017, Nucleic Acids Res..

[27]  Ran Su,et al.  Iterative feature representations improve N4-methylcytosine site prediction , 2019, Bioinform..

[28]  J. McPherson,et al.  Coming of age: ten years of next-generation sequencing technologies , 2016, Nature Reviews Genetics.

[29]  Maryam Zaringhalam,et al.  Pseudouridylation meets next-generation sequencing. , 2016, Methods.

[30]  MrozekDariusz,et al.  An efficient and flexible scanning of databases of protein secondary structures , 2016 .

[31]  Guangmin Liang,et al.  k-Skip-n-Gram-RF: A Random Forest Based Method for Alzheimer's Disease Protein Identification , 2019, Front. Genet..

[32]  Q. Zou,et al.  Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA , 2018, RNA.

[33]  Chuan He,et al.  N6-methyladenosine-dependent RNA structural switches regulate RNA-protein interactions , 2015, Nature.

[34]  Jing Zhang,et al.  Prediction of Novel Drugs for Hepatocellular Carcinoma Based on Multi-Source Random Walk , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[35]  Zhike Lu,et al.  m6A-dependent regulation of messenger RNA stability , 2013, Nature.

[36]  Yi Xiong,et al.  PseUI: Pseudouridine sites identification based on RNA sequence information , 2018, BMC Bioinformatics.

[37]  Liang Yu,et al.  Conserved Disease Modules Extracted From Multilayer Heterogeneous Disease and Gene Networks for Understanding Disease Mechanisms and Predicting Disease Treatments , 2019, Front. Genet..

[38]  Xiaofeng Li,et al.  ANPELA: analysis and performance assessment of the label-free quantification workflow for metaproteomic studies , 2019, Briefings Bioinform..

[39]  Jijun Tang,et al.  Identification of drug-side effect association via multiple information integration with centered kernel alignment , 2019, Neurocomputing.

[40]  Feng Zhu,et al.  Discovery of the Consistently Well-Performed Analysis Chain for SWATH-MS Based Pharmacoproteomic Quantification , 2018, Front. Pharmacol..

[41]  Wei Chen,et al.  iRNA-PseU: Identifying RNA pseudouridine sites , 2016, Molecular therapy. Nucleic acids.

[42]  Joao Castanheira,et al.  FOR PREDICTING PROTEIN-PROTEIN INTERACTIONS , 2018 .

[43]  Leyi Wei,et al.  A novel hierarchical selective ensemble classifier with bioinformatics application , 2017, Artif. Intell. Medicine.

[44]  Qing Zhang,et al.  m6Acomet: large-scale functional prediction of individual m6A RNA methylation sites from an RNA co-methylation network , 2019, BMC Bioinformatics.

[45]  Fei Guo,et al.  AOPs-SVM: A Sequence-Based Classifier of Antioxidant Proteins Using a Support Vector Machine , 2019, Front. Bioeng. Biotechnol..

[46]  W. Cohn Pseudouridine, a carbon-carbon linked ribonucleoside in ribonucleic acids: isolation, structure, and chemical characteristics. , 1960, The Journal of biological chemistry.

[47]  Lin Gao,et al.  Inferring drug-disease associations based on known protein complexes , 2015, BMC Medical Genomics.

[48]  Piero P. Bonissone,et al.  Machine Learning Applications , 2015, Handbook of Computational Intelligence.

[49]  Ye Zhang,et al.  Identifying N6-methyladenosine sites using extreme gradient boosting system optimized by particle swarm optimizer. , 2019, Journal of theoretical biology.

[50]  Jijun Tang,et al.  PhosPred-RF: A Novel Sequence-Based Predictor for Phosphorylation Sites Using Sequential Information Only , 2017, IEEE Transactions on NanoBioscience.

[51]  Jionglong Su,et al.  WHISTLE: a high-accuracy map of the human N6-methyladenosine (m6A) epitranscriptome predicted using a machine learning approach , 2019, Nucleic acids research.

[52]  Renzhi Cao,et al.  Protein single-model quality assessment by feature-based probability density functions , 2016, Scientific Reports.

[53]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[54]  Maxwell R. Mumbach,et al.  Transcriptome-wide Mapping Reveals Widespread Dynamic-Regulated Pseudouridylation of ncRNA and mRNA , 2014, Cell.

[55]  Liang Cheng,et al.  The Assessment of Interleukin-18 on the Risk of Coronary Heart Disease. , 2019, Medicinal chemistry (Shariqah (United Arab Emirates)).

[56]  Rui Sun,et al.  RNAm5CPred: Prediction of RNA 5-Methylcytosine Sites Based on Three Different Kinds of Nucleotide Composition , 2019, Molecular therapy. Nucleic acids.

[57]  Feng Zhu,et al.  Revealing vilazodone's binding mechanism underlying its partial agonism to the 5-HT1A receptor in the treatment of major depressive disorder. , 2017, Physical chemistry chemical physics : PCCP.

[58]  W. Gilbert,et al.  Pseudo-Seq: Genome-Wide Detection of Pseudouridine Modifications in RNA. , 2015, Methods in enzymology.

[59]  Bin Liu,et al.  BioSeq-Analysis: a platform for DNA, RNA and protein sequence analysis based on machine learning approaches , 2019, Briefings Bioinform..

[60]  Feng Zhu,et al.  Differentiating Physicochemical Properties between Addictive and Nonaddictive ADHD Drugs Revealed by Molecular Dynamics Simulation Studies. , 2017, ACS chemical neuroscience.

[61]  Bin Liu,et al.  MotifCNN-fold: protein fold recognition based on fold-specific features extracted by motif-based convolutional neural networks , 2019, Briefings Bioinform..

[62]  Hui Ding,et al.  iRNA(m6A)-PseDNC: Identifying N6-methyladenosine sites using pseudo dinucleotide composition. , 2018, Analytical biochemistry.

[63]  Leyi Wei,et al.  Meta-4mCpred: A Sequence-Based Meta-Predictor for Accurate DNA 4mC Site Prediction Using Effective Feature Representation , 2019, Molecular therapy. Nucleic acids.

[64]  Wei Chen,et al.  iProEP: A Computational Predictor for Predicting Promoter , 2019, Molecular therapy. Nucleic acids.

[65]  Tao Zeng,et al.  Prediction of heme binding residues from protein sequences with integrative sequence profiles , 2012, Proteome Science.

[66]  Jijun Tang,et al.  Local-DPP: An improved DNA-binding protein prediction method by exploring local evolutionary information , 2017, Inf. Sci..

[67]  Gaotao Shi,et al.  CPPred-RF: A Sequence-based Predictor for Identifying Cell-Penetrating Peptides and Their Uptake Efficiency. , 2017, Journal of proteome research.

[68]  K. Chou,et al.  iRNA-Methyl: Identifying N(6)-methyladenosine sites using pseudo nucleotide composition. , 2015, Analytical biochemistry.

[69]  Q. Cui,et al.  SRAMP: prediction of mammalian N6-methyladenosine (m6A) sites based on sequence-derived features , 2016, Nucleic acids research.

[70]  Ying Zhang,et al.  A Review of Recent Advances and Research on Drug Target Identification Methods. , 2019, Current drug metabolism.

[71]  Lin Gao,et al.  Predicting Potential Drugs for Breast Cancer based on miRNA and Tissue Specificity , 2018, International journal of biological sciences.

[72]  Chengqi Yi,et al.  Transcriptome-wide dynamics of RNA pseudouridylation , 2015, Nature Reviews Molecular Cell Biology.

[73]  Paul F Agris,et al.  Bringing order to translation: the contributions of transfer RNA anticodon‐domain modifications , 2008, EMBO reports.

[74]  Jiajie Peng,et al.  InfAcrOnt: calculating cross-ontology term similarities using information flow by a random walk , 2018, BMC Genomics.

[75]  Dariusz Mrozek,et al.  search GenBank: interactive orchestration and ad-hoc choreography of Web services in the exploration of the biomedical resources of the National Center For Biotechnology Information , 2013, BMC Bioinformatics.

[76]  Wei Chen,et al.  i6mA-Pred: identifying DNA N6-methyladenine sites in the rice genome , 2019, Bioinform..

[77]  Dariusz Mrozek,et al.  Scaling Ab Initio Predictions of 3D Protein Structures in Microsoft Azure Cloud , 2015, Journal of Grid Computing.

[78]  Jijun Tang,et al.  Analysis of Co-Associated Transcription Factors via Ordered Adjacency Differences on Motif Distribution , 2017, Scientific Reports.

[79]  Wei Chen,et al.  iDNA4mC: identifying DNA N4‐methylcytosine sites based on nucleotide chemical properties , 2017, Bioinform..

[80]  Wanying Xu,et al.  OAHG: an integrated resource for annotating human genes with multi-level ontologies , 2016, Scientific Reports.

[81]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[82]  M. Helm,et al.  tRNA stabilization by modified nucleotides. , 2010, Biochemistry.

[83]  Guangmin Liang,et al.  An Efficient Classifier for Alzheimer’s Disease Genes Identification , 2018, Molecules.

[84]  Yi Xiong,et al.  STS-NLSP: A Network-Based Label Space Partition Method for Predicting the Specificity of Membrane Transporter Substrates Using a Hybrid Feature of Structural and Semantic Similarity , 2019, Front. Bioeng. Biotechnol..

[85]  K. Chou,et al.  iDNA6mA-PseKNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. , 2018, Genomics.

[86]  Hua Tang,et al.  A two-step discriminated method to identify thermophilic proteins , 2017 .

[87]  Ming Zhang,et al.  Accurate RNA 5-methylcytosine site prediction based on heuristic physical-chemical properties reduction and classifier ensemble. , 2018, Analytical biochemistry.

[88]  Mark Helm,et al.  Post-transcriptional nucleotide modification and alternative folding of RNA , 2006, Nucleic acids research.

[89]  Jijun Tang,et al.  Identification of drug-target interactions via multiple information integration , 2017, Inf. Sci..

[90]  Guangmin Liang,et al.  A Novel Hybrid Sequence-Based Model for Identifying Anticancer Peptides , 2018, Genes.

[91]  Bin Liu,et al.  DeepSVM-fold: protein fold recognition by combining support vector machines and pairwise sequence similarity scores generated by deep learning networks , 2019, Briefings Bioinform..

[92]  Liang Kong,et al.  i6mA-DNCP: Computational Identification of DNA N6-Methyladenine Sites in the Rice Genome Using Optimized Dinucleotide-Based Features , 2019, Genes.

[93]  Jin Zhao,et al.  Drug repositioning based on triangularly balanced structure for tissue-specific diseases in incomplete interactome , 2017, Artif. Intell. Medicine.

[94]  Hui Ding,et al.  A Random Forest Sub-Golgi Protein Classifier Optimized via Dipeptide and Amino Acid Composition Features , 2019, Front. Bioeng. Biotechnol..

[95]  Dariusz Mrozek,et al.  An efficient and flexible scanning of databases of protein secondary structures , 2014, Journal of Intelligent Information Systems.

[96]  Wei Chen,et al.  Identifying RNA 5-methylcytosine sites via pseudo nucleotide compositions. , 2016, Molecular bioSystems.

[97]  Jijun Tang,et al.  Identification of Protein–Protein Interactions via a Novel Matrix-Based Sequence Representation Model with Amino Acid Contact Information , 2016, International journal of molecular sciences.

[98]  Minoru Yoshida,et al.  RNA-Methylation-Dependent RNA Processing Controls the Speed of the Circadian Clock , 2013, Cell.

[99]  Wei Chena,et al.  6 A )-PseDNC : Identifying N 6-methyladenosine sites using pseudo dinucleotide composition , 2018 .

[100]  Abdollah Dehzangi,et al.  PyFeat: a Python-based effective feature generation tool for DNA, RNA and protein sequences , 2019, Bioinform..

[101]  Chuan He,et al.  Pseudouridine in a new era of RNA modifications , 2014, Cell Research.

[102]  Frauke Degenhardt,et al.  Evaluation of variable selection methods for random forests and omics data sets , 2017, Briefings Bioinform..

[103]  Jijun Tang,et al.  Predicting protein-protein interactions via multivariate mutual information of protein sequences , 2016, BMC Bioinformatics.

[104]  Zhiming Dai,et al.  SNNRice6mA: A Deep Learning Method for Predicting DNA N6-Methyladenine Sites in Rice Genome , 2019, Front. Genet..

[105]  Liang Cheng,et al.  Exposing the Causal Effect of Body Mass Index on the Risk of Type 2 Diabetes Mellitus: A Mendelian Randomization Study , 2019, Front. Genet..

[106]  A. Nair,et al.  A coding measure scheme employing electron-ion interaction pseudopotential (EIIP) , 2006, Bioinformation.

[107]  Jie Sun,et al.  DincRNA: a comprehensive web-based bioinformatics toolkit for exploring disease associations and ncRNA function , 2018, Bioinform..

[108]  Yan-Hui Li,et al.  PPUS: a web server to predict PUS-specific pseudouridine sites , 2015, Bioinform..

[109]  Bo Li,et al.  NOREVA: normalization and evaluation of MS-based metabolomics data , 2017, Nucleic Acids Res..

[110]  Yuzong Chen,et al.  What Contributes to Serotonin-Norepinephrine Reuptake Inhibitors' Dual-Targeting Mechanism? The Key Role of Transmembrane Domain 6 in Human Serotonin and Norepinephrine Transporters Revealed by Molecular Dynamics Simulation. , 2018, ACS chemical neuroscience.

[111]  Robert C. Wolpert,et al.  A Review of the , 1985 .

[112]  Feng Zhu,et al.  VARIDT 1.0: variability of drug transporter database , 2019, Nucleic Acids Res..

[113]  Dong-Qing Wei,et al.  PredT4SE-Stack: Prediction of Bacterial Type IV Secreted Effectors From Protein Sequences Using a Stacked Ensemble Method , 2018, Front. Microbiol..

[114]  Feng Zhu,et al.  Simultaneous Improvement in the Precision, Accuracy, and Robustness of Label-free Proteome Quantification by Optimizing Data Manipulation Chains* , 2019, Molecular & Cellular Proteomics.

[115]  K. Chou,et al.  iRNA-3typeA: Identifying Three Types of Modification at RNA’s Adenosine Sites , 2018, Molecular therapy. Nucleic acids.

[116]  Samie R. Jaffrey,et al.  The dynamic epitranscriptome: N6-methyladenosine and gene expression control , 2014, Nature Reviews Molecular Cell Biology.

[117]  Hao Lin,et al.  XG-PseU: an eXtreme Gradient Boosting based method for identifying pseudouridine sites , 2019, Molecular Genetics and Genomics.

[118]  Fei Guo,et al.  Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier , 2017, Artif. Intell. Medicine.

[119]  Ran Su,et al.  Exploring sequence‐based features for the improved prediction of DNA N4‐methylcytosine sites in multiple species , 2018, Bioinform..

[120]  Hao Lin,et al.  iDNA6mA-Rice: A Computational Tool for Detecting N6-Methyladenine Sites in Rice , 2019, Front. Genet..