A Linear Regression Predictor for Identifying N6-Methyladenosine Sites Using Frequent Gapped K-mer Pattern

N6-methyladenosine (m6A) is one of the most common and abundant modifications in RNA, which is related to many biological processes in humans. Abnormal RNA modifications are often associated with a series of diseases, including tumors, neurogenic diseases, and embryonic retardation. Therefore, identifying m6A sites is of paramount importance in the post-genomic age. Although many lab-based methods have been proposed to annotate m6A sites, they are time consuming and cost ineffective. In view of the drawbacks of the intrinsic methods in RNA sequence recognition, computational methods are suggested as a supplement to identify m6A sites. In this study, we develop a novel feature extraction algorithm based on the frequent gapped k-mer pattern (FGKP) and apply the linear regression to construct the prediction model. The new predictor is used to identify m6A sites in the Saccharomyces cerevisiae database. It has been shown by the 10-fold cross-validation that the performance is better than that of recent methods. Comparative results indicate that our model has great potential to become a useful and effective tool for genome analysis and gain more insights for locating m6A sites.

[1]  David O. Carter,et al.  Microbiome Data Accurately Predicts the Postmortem Interval Using Random Forest Regression Models , 2018, Genes.

[2]  Richard G. F. Visser,et al.  Integration of multi-omics data for prediction of phenotypic traits using random forest , 2016, BMC Bioinformatics.

[3]  Yan Lin,et al.  iTerm-PseKNC: a sequence-based tool for predicting bacterial transcriptional terminators , 2018, Bioinform..

[4]  Wei Chen,et al.  iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition , 2014, Nucleic acids research.

[5]  Xiangxiang Zeng,et al.  Inferring MicroRNA-Disease Associations by Random Walk on a Heterogeneous Network with Multiple Data Sources , 2017, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[6]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[7]  Evaluation of random forest regression for prediction of breeding value from genomewide SNPs , 2015, Journal of Genetics.

[8]  Zhike Lu,et al.  m6A-dependent regulation of messenger RNA stability , 2013, Nature.

[9]  Chih-Jen Lin,et al.  LIBLINEAR: A Library for Large Linear Classification , 2008, J. Mach. Learn. Res..

[10]  O. Elemento,et al.  Comprehensive Analysis of mRNA Methylation Reveals Enrichment in 3′ UTRs and near Stop Codons , 2012, Cell.

[11]  K. Chou,et al.  iRNA-3typeA: Identifying Three Types of Modification at RNA’s Adenosine Sites , 2018, Molecular therapy. Nucleic acids.

[12]  Hua Tang,et al.  Identification of immunoglobulins using Chou's pseudo amino acid composition with feature selection technique. , 2016, Molecular bioSystems.

[13]  Ran Su,et al.  M6APred-EL: A Sequence-Based Predictor for Identifying N6-methyladenosine Sites Using Ensemble Learning , 2018, Molecular therapy. Nucleic acids.

[14]  Wei Tang,et al.  Tumor origin detection with tissue‐specific miRNA and DNA methylation markers , 2018, Bioinform..

[15]  Zhike Lu,et al.  Unique Features of the m6A Methylome in Arabidopsis thaliana , 2014, Nature Communications.

[16]  Jingpu Zhang,et al.  Integrating Multiple Heterogeneous Networks for Novel LncRNA-Disease Association Inference , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[17]  Kuo-Chen Chou,et al.  A Novel Modeling in Mathematical Biology for Classification of Signal Peptides , 2018, Scientific Reports.

[18]  Leyi Wei,et al.  A novel hierarchical selective ensemble classifier with bioinformatics application , 2017, Artif. Intell. Medicine.

[19]  K. Chou,et al.  iSNO-AAPair: incorporating amino acid pairwise coupling into PseAAC for predicting cysteine S-nitrosylation sites in proteins , 2013, PeerJ.

[20]  Robert P. Sheridan,et al.  Random Forest: A Classification and Regression Tool for Compound Classification and QSAR Modeling , 2003, J. Chem. Inf. Comput. Sci..

[21]  Qionghai Dai,et al.  WBSMDA: Within and Between Score for MiRNA-Disease Association prediction , 2016, Scientific Reports.

[22]  Jef Rozenski,et al.  The RNA modification database, RNAMDB: 2011 update , 2010, Nucleic Acids Res..

[23]  Wei Chen,et al.  i6mA-Pred: identifying DNA N6-methyladenine sites in the rice genome , 2019, Bioinform..

[24]  K. Chou,et al.  Using Functional Domain Composition and Support Vector Machines for Prediction of Protein Subcellular Location* , 2002, The Journal of Biological Chemistry.

[25]  K. Chou Prediction of signal peptides using scaled window , 2001, Peptides.

[26]  Wei Chen,et al.  iORI-PseKNC: A predictor for identifying origin of replication with pseudo k-tuple nucleotide composition , 2015 .

[27]  Wei Chen,et al.  Sequence-based predictive modeling to identify cancerlectins , 2017, Oncotarget.

[28]  Q. Cui,et al.  SRAMP: prediction of mammalian N6-methyladenosine (m6A) sites based on sequence-derived features , 2016, Nucleic acids research.

[29]  Wei Chen,et al.  Detecting N6-methyladenosine sites from RNA transcriptomes using ensemble Support Vector Machines , 2017, Scientific Reports.

[30]  Schraga Schwartz,et al.  High-Resolution Mapping Reveals a Conserved, Widespread, Dynamic mRNA Methylation Program in Yeast Meiosis , 2013, Cell.

[31]  Tao Pan,et al.  Dynamic RNA Modifications in Gene Expression Regulation , 2017, Cell.

[32]  Wei Chen,et al.  iDNA4mC: identifying DNA N4‐methylcytosine sites based on nucleotide chemical properties , 2017, Bioinform..

[33]  Wei Chen,et al.  MethyRNA: a web server for identification of N6-methyladenosine sites , 2017, Journal of biomolecular structure & dynamics.

[34]  K. Chou,et al.  iDNA6mA-PseKNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. , 2018, Genomics.

[35]  Hamed Mirzaei,et al.  MicroRNAs: Potential candidates for diagnosis and treatment of colorectal cancer , 2018, Journal of cellular physiology.

[36]  K. Chou,et al.  Prediction of linear B-cell epitopes using amino acid pair antigenicity scale , 2007, Amino Acids.

[37]  R. Desrosiers,et al.  Identification of methylated nucleosides in messenger RNA from Novikoff hepatoma cells. , 1974, Proceedings of the National Academy of Sciences of the United States of America.

[38]  Geoffrey I. Webb,et al.  iProt-Sub: a comprehensive package for accurately mapping and predicting protease-specific substrates and cleavage sites , 2018, Briefings Bioinform..

[39]  Fan Yang,et al.  iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC , 2018, Bioinform..

[40]  Samie R Jaffrey,et al.  Rethinking m6A Readers, Writers, and Erasers. , 2017, Annual review of cell and developmental biology.

[41]  Ying Ju,et al.  Improving tRNAscan‐SE Annotation Results via Ensemble Classifiers , 2015, Molecular informatics.

[42]  K. Chou,et al.  iRNA-Methyl: Identifying N(6)-methyladenosine sites using pseudo nucleotide composition. , 2015, Analytical biochemistry.

[43]  H. Hoefsloot,et al.  Chronological age prediction based on DNA methylation: Massive parallel sequencing and random forest regression. , 2017, Forensic science international. Genetics.

[44]  Ming Zhang,et al.  Improving N(6)-methyladenosine site prediction with heuristic selection of nucleotide physical-chemical properties. , 2016, Analytical biochemistry.

[45]  Christopher E. Mason,et al.  Single-nucleotide resolution mapping of m6A and m6Am throughout the transcriptome , 2015, Nature Methods.

[46]  Chengqi Yi,et al.  N6-Methyladenosine in Nuclear RNA is a Major Substrate of the Obesity-Associated FTO , 2011, Nature chemical biology.

[47]  Hao Lin,et al.  Identifying Sigma70 Promoters with Novel Pseudo Nucleotide Composition , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[48]  Wei Chen,et al.  Identifying N6-methyladenosine sites in the Arabidopsis thaliana transcriptome , 2016, Molecular Genetics and Genomics.

[49]  Suzanne Cory,et al.  Modified nucleosides and bizarre 5′-termini in mouse myeloma mRNA , 1975, Nature.

[50]  K. Chou,et al.  iRNA-PseColl: Identifying the Occurrence Sites of Different RNA Modifications by Incorporating Collective Effects of Nucleotides into PseKNC , 2017, Molecular therapy. Nucleic acids.

[51]  T. Nilsen Internal mRNA Methylation Finally Finds Functions , 2014, Science.

[52]  Ren Long,et al.  iRSpot-EL: identify recombination spots with an ensemble learning approach , 2017, Bioinform..

[53]  B. Liu,et al.  Identification of Real MicroRNA Precursors with a Pseudo Structure Status Composition Approach , 2015, PloS one.

[54]  K. Chou,et al.  iPGK-PseAAC: Identify Lysine Phosphoglycerylation Sites in Proteins by Incorporating Four Different Tiers of Amino Acid Pairwise Coupling Information into the General PseAAC. , 2017, Medicinal chemistry (Shariqah (United Arab Emirates)).

[55]  Y. Zhang,et al.  Exosomal long noncoding RNA HOTTIP as potential novel diagnostic and prognostic biomarker test for gastric cancer , 2018, Molecular Cancer.

[56]  Hua Tang,et al.  Identification of Secretory Proteins in Mycobacterium tuberculosis Using Pseudo Amino Acid Composition , 2016, BioMed research international.

[57]  Hui Ding,et al.  iRNA(m6A)-PseDNC: Identifying N6-methyladenosine sites using pseudo dinucleotide composition. , 2018, Analytical biochemistry.

[58]  Fei Guo,et al.  Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier , 2017, Artif. Intell. Medicine.

[59]  Bing Ren,et al.  N6-methyladenosine-dependent regulation of messenger RNA stability , 2013 .

[60]  Jingpu Zhang,et al.  Prediction of lncRNA-protein interactions using HeteSim scores based on heterogeneous networks , 2017, Scientific Reports.

[61]  Kuo-Chen Chou,et al.  2L-piRNA: A Two-Layer Ensemble Classifier for Identifying Piwi-Interacting RNAs and Their Function , 2017, Molecular therapy. Nucleic acids.

[62]  K. Chou,et al.  Support vector machines for predicting membrane protein types by using functional domain composition. , 2003, Biophysical journal.

[63]  M. Kupiec,et al.  Topology of the human and mouse m6A RNA methylomes revealed by m6A-seq , 2012, Nature.

[64]  Ren Long,et al.  dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation , 2016, Scientific Reports.

[65]  Xiaolong Wang,et al.  Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection , 2013, Bioinform..