DeepAc4C: a convolutional neural network model with hybrid features composed of physicochemical patterns and distributed representation information for identification of N4-acetylcytidine in mRNA

MOTIVATION N4-acetylcytidine (ac4C) is the only acetylation modification that has been characterized in eukaryotic RNA, and is correlated with various human diseases. Laboratory identification of ac4C is complicated by factors such as sample hydrolysis and high cost. Unfortunately, existing computational methods to identify ac4C do not achieve satisfactory performance. RESULTS We developed a novel tool, DeepAc4C, which identifies ac4C using convolutional neural networks (CNNs) using hybrid features composed of physicochemical patterns and a distributed representation of nucleic acids. Our results show that the proposed model achieved better and more balanced performance than existing predictors. Furthermore, we evaluated the effect that specific features had on the model predictions and their interaction effects. Several interesting sequence motifs specific to ac4C were identified. AVAILABILITY AND IMPLEMENTATION The webserver is freely accessible at https://webmalab.cn/, the source code and datasets are accessible at Zenodo with URL https://doi.org/10.5281/zenodo.5138047 and Github with URL https://github.com/wangchao-malab/DeepAc4C. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.

[1]  Wei Chen,et al.  iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition , 2014, Bioinform..

[2]  Fei Guo,et al.  Improved prediction of protein-protein interactions using novel negative samples, features, and an ensemble classifier , 2017, Artif. Intell. Medicine.

[3]  Shuguang Han,et al.  Its2vec: Fungal Species Identification Using Sequence Embedding and Random Forest Classification , 2020, BioMed research international.

[4]  Mehmet Tardu,et al.  Identification and quantification of modified nucleosides in Saccharomyces cerevisiae mRNAs , 2018, bioRxiv.

[5]  Yue Gao,et al.  Improved and promising identification of human MicroRNAs by incorporating a high-quality negative set , 2014, TCBB.

[6]  De-Shuang Huang,et al.  iEnhancer‐EL: identifying enhancers and their strength with ensemble learning approach , 2018, Bioinform..

[7]  Aldema Sas-Chen,et al.  A Chemical Signature for Cytidine Acetylation in RNA. , 2018, Journal of the American Chemical Society.

[8]  Kil To Chong,et al.  XG-ac4C: identification of N4-acetylcytidine (ac4C) in mRNA using eXtreme gradient boosting with electron-ion interaction pseudopotentials , 2020, Scientific Reports.

[9]  Shiwei Duan,et al.  The Processing, Gene Regulation, Biological Functions, and Clinical Relevance of N4-Acetylcytidine on RNA: A Systematic Review , 2020, Molecular therapy. Nucleic acids.

[10]  H. Feldmann,et al.  Nucleotide sequences of two serine-specific transfer ribonucleic acids. , 1966, Angewandte Chemie.

[11]  David Sturgill,et al.  Acetylation of Cytidine in mRNA Promotes Translation Efficiency , 2018, Cell.

[12]  Geoffrey I. Webb,et al.  iLearn : an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data , 2019, Briefings Bioinform..

[13]  Ehsaneddin Asgari,et al.  ProtVec: A Continuous Distributed Representation of Biological Sequences , 2015, ArXiv.

[14]  Jian Chen,et al.  16S rRNA sequence embeddings: Meaningful numeric feature representations of nucleotide sequences that are convenient for downstream analyses , 2018, bioRxiv.

[15]  J. McCloskey,et al.  Characterization of C + located in the first position of the anticodon of Escherichia coli tRNA Met as N 4 -acetylcytidine. , 1972, Biochimica et biophysica acta.

[16]  K. Chou,et al.  PseKNC: a flexible web server for generating pseudo K-tuple nucleotide composition. , 2014, Analytical biochemistry.

[17]  Han Zhang,et al.  BioSeq-Analysis2.0: an updated platform for analyzing DNA, RNA and protein sequences at sequence level and residue level based on machine learning approaches , 2019, Nucleic acids research.

[18]  K. Chou,et al.  A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins. , 1993, The Journal of biological chemistry.

[19]  Jijun Tang,et al.  DeepATT: a hybrid category attention neural network for identifying functional effects of DNA sequences , 2020, Briefings Bioinform..

[20]  Hao Lv,et al.  Deep-Kcr: accurate detection of lysine crotonylation sites using deep learning method , 2020, Briefings Bioinform..

[21]  L. Shapley A Value for n-person Games , 1988 .

[22]  Mohamed Chaabane,et al.  circDeep: deep learning approach for circular RNA classification from other long non-coding RNA , 2019, Bioinform..

[23]  Xiaolong Wang,et al.  repDNA: a Python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects , 2015, Bioinform..

[24]  A. Nair,et al.  A coding measure scheme employing electron-ion interaction pseudopotential (EIIP) , 2006, Bioinformation.

[25]  Yasubumi Sakakibara,et al.  Convolutional neural networks for classification of alignments of non-coding RNA sequences , 2018, Bioinform..

[26]  B. Baguley,et al.  Structure of a Mammalian Serine tRNA , 1968, Nature.

[27]  Chao Wang,et al.  NonClasGP-Pred: robust and efficient prediction of non-classically secreted proteins by integrating subset-specific optimal models of imbalanced data , 2020, Microbial genomics.

[28]  Schraga Schwartz,et al.  Dynamic RNA acetylation revealed by quantitative cross-evolutionary mapping , 2020, Nature.

[29]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[30]  Wanqing Zhao,et al.  PACES: prediction of N4-acetylcytidine (ac4C) modification sites in mRNA , 2019, Scientific Reports.

[31]  Xing Gao,et al.  Integration of deep feature representations and handcrafted features to improve the prediction of N6-methyladenosine sites , 2019, Neurocomputing.

[32]  Q. Zou,et al.  Gene2vec: gene subsequence embedding for prediction of mammalian N6-methyladenosine sites from mRNA , 2018, RNA.

[33]  J. Gordon,et al.  N4-Acetylcytidine. A previously unidentified labile component of the small subunit of eukaryotic ribosomes. , 1978, The Journal of biological chemistry.

[34]  Gijs Geleijnse,et al.  Explainable machine learning can outperform Cox regression predictions and provide insights in breast cancer survival , 2021, Scientific Reports.

[35]  Kenneth Ward Church,et al.  Word2Vec , 2016, Natural Language Engineering.

[36]  J. Meier,et al.  Nucleotide resolution sequencing of N4-acetylcytidine in RNA. , 2019, Methods in enzymology.