论文信息 - A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information - 字舞流文

A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information

Recently, language representation models have drawn a lot of attention in the natural language processing field due to their remarkable results. Among them, bidirectional encoder representations from transformers (BERT) has proven to be a simple, yet powerful language model that achieved novel state-of-the-art performance. BERT adopted the concept of contextualized word embedding to capture the semantics and context of the words in which they appeared. In this study, we present a novel technique by incorporating BERT-based multilingual model in bioinformatics to represent the information of DNA sequences. We treated DNA sequences as natural sentences and then used BERT models to transform them into fixed-length numerical matrices. As a case study, we applied our method to DNA enhancer prediction, which is a well-known and challenging problem in this field. We then observed that our BERT-based features improved more than 5-10% in terms of sensitivity, specificity, accuracy and Matthews correlation coefficient compared to the current state-of-the-art features in bioinformatics. Moreover, advanced experiments show that deep learning (as represented by 2D convolutional neural networks; CNN) holds potential in learning BERT features better than other traditional machine learning techniques. In conclusion, we suggest that BERT and 2D CNNs could open a new avenue in biological modeling using sequence information.

Yu-Yen Ou | Trinh-Trung-Duong Nguyen | Quang-Thai Ho | Nguyen-Quoc-Khanh Le | Yu-Yen Ou | N. Le | Quang-Thai Ho | Trinh-trung-duong Nguyen

[1] Ren Long,et al. iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition , 2016, Bioinform..

[2] Renzhi Cao,et al. Predicting the DJIA with News Headlines and Historic Data Using Hybrid Genetic Algorithm/Support Vector Regression and BERT , 2020, BigData.

[3] J. T. Kadonaga,et al. Going the distance: a current view of enhancer action. , 1998, Science.

[4] Zhengwei Zhu,et al. CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[5] A. Shilatifard,et al. Enhancer Logic and Mechanics in Development and Disease. , 2018, Trends in cell biology.

[6] Tomas Mikolov,et al. Enriching Word Vectors with Subword Information , 2016, TACL.

[7] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[8] Ehsaneddin Asgari,et al. ProtVec: A Continuous Distributed Representation of Biological Sequences , 2015, ArXiv.

[9] Trang T. Le,et al. Using deep neural networks and biological subwords to detect protein S-sulfenylation sites , 2020, Briefings Bioinform..

[10] Siquan Hu,et al. An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences , 2019, PloS one.

[11] Nguyen Quoc Khanh Le,et al. Classifying Promoters by Interpreting the Hidden Information of DNA Sequences via Deep Learning and Combination of Continuous FastText N-Grams , 2019, Front. Bioeng. Biotechnol..

[12] De-Shuang Huang,et al. iEnhancer‐EL: identifying enhancers and their strength with ensemble learning approach , 2018, Bioinform..

[13] William H. Majoros,et al. Genomics and natural language processing , 2002, Nature Reviews Genetics.

[14] Huibing Zhang,et al. A Commodity Review Sentiment Analysis Based on BERT-CNN Model , 2020, 2020 5th International Conference on Computer and Communication Systems (ICCCS).

[15] Wei Chen,et al. iMRM: a platform for simultaneously identifying multiple kinds of RNA modifications , 2020, Bioinform..

[16] Peter C Scacheri,et al. Enhancers: bridging the gap between gene control and human disease. , 2018, Human molecular genetics.

[17] Burkhard Rost,et al. Modeling aspects of the language of life through transfer-learning protein sequences , 2019, BMC Bioinformatics.

[18] Geoffrey I. Webb,et al. iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences , 2018, Bioinform..

[19] Cangzhi Jia,et al. EnhancerPred: a predictor for discovering enhancers based on the combination and selection of multiple features , 2016, Scientific Reports.

[20] Yu-Yen Ou,et al. iEnhancer-5Step: Identifying enhancers using hidden information of DNA sequences via Chou's 5-step rule and word embedding. , 2019, Analytical biochemistry.

[21] Tuan-Tu Huynh,et al. Identification of clathrin proteins by incorporating hyperparameter optimization in deep learning and PSSM profiles , 2019, Comput. Methods Programs Biomed..

[22] The UniProt Consortium,et al. UniProt: a worldwide hub of protein knowledge , 2018, Nucleic Acids Res..

[23] Dong Si,et al. TopQA: a topological representation for single-model protein quality assessment with machine learning , 2020, Int. J. Comput. Biol. Drug Des..

[24] Yu-Yen Ou,et al. DeepETC: A deep convolutional neural network architecture for investigating and classifying electron transport chain's complexes , 2020, Neurocomputing.

[25] Tomas Mikolov,et al. Bag of Tricks for Efficient Text Classification , 2016, EACL.

[26] Renzhi Cao,et al. ProLanGO2: Protein Function Prediction with Ensemble of Encoder-Decoder Networks , 2020 .

[27] Jijun Tang,et al. DeepAVP: A Dual-Channel Deep Neural Network for Identifying Variable-Length Antiviral Peptides , 2020, IEEE Journal of Biomedical and Health Informatics.

[28] Lei Wang,et al. A Convolutional Neural Network Using Dinucleotide One-hot Encoder for identifying DNA N6-Methyladenine Sites in the Rice Genome , 2021, Neurocomputing.

[29] G. Bejerano,et al. Enhancers: five essential questions , 2013, Nature Reviews Genetics.