Word2vec based deep learning network for DNA N4-methylcytosine sites identification

Abstract As an important epigenetic modification, N4-methylcytosine not only controls DNA replication and cell cycle, but also participates in regulating cell differentiation and gene expression. However, the understanding of its biological function is far from enough. In order to further reveal the function and regulatory mechanism of 4mC, it is important to accurately identify the 4mC site and detect its distribution in the genome. In this study, we propose 4mcDeep-W2VC, a general and efficient deep neural network to identify 4mC sites. Different with other methods, our proposed predictor can automatically extract features based on DNA sequences. We use the word2vec algorithm to learn the distributed representation of k-mers instead of one-hot encoding. Compared with the traditional k-mers method, the distributed representation we obtained considers the potential relationship between k-mers. Next, we input the distributed representation of the DNA sequence into convolutional neural network to extract hidden high-level and more biological features. The experimental results show that our predictor can achieve better performance in identifying 4mC sites compared to the state-of-art predictors.

[1]  Hiroyuki Kurata,et al.  i4mC-Mouse: Improved identification of DNA N4-methylcytosine sites in the mouse genome using multiple encoding schemes , 2020, Computational and structural biotechnology journal.

[2]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[3]  Wei Chen,et al.  iRNAD: a computational tool for identifying D modification sites in RNA sequence , 2019, Bioinform..

[4]  Wei Chen,et al.  DNA4mC-LIP: a linear integration method to identify N4-methylcytosine site in multiple species , 2020, Bioinform..

[5]  Patrick Ng,et al.  dna2vec: Consistent vector representations of variable-length k-mers , 2017, ArXiv.

[6]  Hong-Bin Shen,et al.  RNA-protein binding motifs mining with a new hybrid deep learning based cross-domain knowledge integration approach , 2016, BMC Bioinformatics.

[7]  A Janulaitis,et al.  Cytosine modification in DNA by BcnI methylase yields N 4‐methylcytosine , 1983, FEBS letters.

[8]  Ehsaneddin Asgari,et al.  ProtVec: A Continuous Distributed Representation of Biological Sequences , 2015, ArXiv.

[9]  Chuan-Le Xiao,et al.  MDR: an integrative DNA N6-methyladenine and N4-methylcytosine modification database for Rosaceae , 2019, Horticulture Research.

[10]  Ran Su,et al.  Iterative feature representations improve N4-methylcytosine site prediction , 2019, Bioinform..

[11]  Hui Liu,et al.  D2VCB: A Hybrid Deep Neural Network for the Prediction of in-vivo Protein-DNA Binding from Combined DNA Sequence , 2019, 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[12]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[13]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[14]  Mohamed Chaabane,et al.  Comprehensive evaluation of deep learning architectures for prediction of DNA/RNA sequence binding specificities , 2019, Bioinform..

[15]  Jianyang Zeng,et al.  A deep learning framework for modeling structural features of RNA-binding protein targets , 2015, Nucleic acids research.

[16]  Lan Yao,et al.  A Deep Neural Network for Identifying DNA N4-Methylcytosine Sites , 2020, Frontiers in Genetics.

[17]  Hui Liu,et al.  A deep neural network approach using distributed representations of RNA sequence and structure for identifying binding site of RNA-binding proteins , 2019, 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[18]  James M. Hogan,et al.  Distributed Representations for Biological Sequence Analysis , 2016, ArXiv.

[19]  Robert J. Schmitz,et al.  Base-resolution detection of N4-methylcytosine in genomic DNA using 4mC-Tet-assisted-bisulfite- sequencing , 2015, Nucleic acids research.