A transformer architecture based on BERT and 2D convolutional neural network to identify DNA enhancers from sequence information

Recently, language representation models have drawn a lot of attention in the natural language processing field due to their remarkable results. Among them, bidirectional encoder representations from transformers (BERT) has proven to be a simple, yet powerful language model that achieved novel state-of-the-art performance. BERT adopted the concept of contextualized word embedding to capture the semantics and context of the words in which they appeared. In this study, we present a novel technique by incorporating BERT-based multilingual model in bioinformatics to represent the information of DNA sequences. We treated DNA sequences as natural sentences and then used BERT models to transform them into fixed-length numerical matrices. As a case study, we applied our method to DNA enhancer prediction, which is a well-known and challenging problem in this field. We then observed that our BERT-based features improved more than 5-10% in terms of sensitivity, specificity, accuracy and Matthews correlation coefficient compared to the current state-of-the-art features in bioinformatics. Moreover, advanced experiments show that deep learning (as represented by 2D convolutional neural networks; CNN) holds potential in learning BERT features better than other traditional machine learning techniques. In conclusion, we suggest that BERT and 2D CNNs could open a new avenue in biological modeling using sequence information.

[1]  Ren Long,et al.  iEnhancer-2L: a two-layer predictor for identifying enhancers and their strength by pseudo k-tuple nucleotide composition , 2016, Bioinform..

[2]  Renzhi Cao,et al.  Predicting the DJIA with News Headlines and Historic Data Using Hybrid Genetic Algorithm/Support Vector Regression and BERT , 2020, BigData.

[3]  J. T. Kadonaga,et al.  Going the distance: a current view of enhancer action. , 1998, Science.

[4]  Zhengwei Zhu,et al.  CD-HIT: accelerated for clustering the next-generation sequencing data , 2012, Bioinform..

[5]  A. Shilatifard,et al.  Enhancer Logic and Mechanics in Development and Disease. , 2018, Trends in cell biology.

[6]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[7]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[8]  Ehsaneddin Asgari,et al.  ProtVec: A Continuous Distributed Representation of Biological Sequences , 2015, ArXiv.

[9]  Trang T. Le,et al.  Using deep neural networks and biological subwords to detect protein S-sulfenylation sites , 2020, Briefings Bioinform..

[10]  Siquan Hu,et al.  An improved deep learning method for predicting DNA-binding proteins based on contextual features in amino acid sequences , 2019, PloS one.

[11]  Nguyen Quoc Khanh Le,et al.  Classifying Promoters by Interpreting the Hidden Information of DNA Sequences via Deep Learning and Combination of Continuous FastText N-Grams , 2019, Front. Bioeng. Biotechnol..

[12]  De-Shuang Huang,et al.  iEnhancer‐EL: identifying enhancers and their strength with ensemble learning approach , 2018, Bioinform..

[13]  William H. Majoros,et al.  Genomics and natural language processing , 2002, Nature Reviews Genetics.

[14]  Huibing Zhang,et al.  A Commodity Review Sentiment Analysis Based on BERT-CNN Model , 2020, 2020 5th International Conference on Computer and Communication Systems (ICCCS).

[15]  Wei Chen,et al.  iMRM: a platform for simultaneously identifying multiple kinds of RNA modifications , 2020, Bioinform..

[16]  Peter C Scacheri,et al.  Enhancers: bridging the gap between gene control and human disease. , 2018, Human molecular genetics.

[17]  Burkhard Rost,et al.  Modeling aspects of the language of life through transfer-learning protein sequences , 2019, BMC Bioinformatics.

[18]  Geoffrey I. Webb,et al.  iFeature: a Python package and web server for features extraction and selection from protein and peptide sequences , 2018, Bioinform..

[19]  Cangzhi Jia,et al.  EnhancerPred: a predictor for discovering enhancers based on the combination and selection of multiple features , 2016, Scientific Reports.

[20]  Yu-Yen Ou,et al.  iEnhancer-5Step: Identifying enhancers using hidden information of DNA sequences via Chou's 5-step rule and word embedding. , 2019, Analytical biochemistry.

[21]  Tuan-Tu Huynh,et al.  Identification of clathrin proteins by incorporating hyperparameter optimization in deep learning and PSSM profiles , 2019, Comput. Methods Programs Biomed..

[22]  The UniProt Consortium,et al.  UniProt: a worldwide hub of protein knowledge , 2018, Nucleic Acids Res..

[23]  Dong Si,et al.  TopQA: a topological representation for single-model protein quality assessment with machine learning , 2020, Int. J. Comput. Biol. Drug Des..

[24]  Yu-Yen Ou,et al.  DeepETC: A deep convolutional neural network architecture for investigating and classifying electron transport chain's complexes , 2020, Neurocomputing.

[25]  Tomas Mikolov,et al.  Bag of Tricks for Efficient Text Classification , 2016, EACL.

[26]  Renzhi Cao,et al.  ProLanGO2: Protein Function Prediction with Ensemble of Encoder-Decoder Networks , 2020 .

[27]  Jijun Tang,et al.  DeepAVP: A Dual-Channel Deep Neural Network for Identifying Variable-Length Antiviral Peptides , 2020, IEEE Journal of Biomedical and Health Informatics.

[28]  Lei Wang,et al.  A Convolutional Neural Network Using Dinucleotide One-hot Encoder for identifying DNA N6-Methyladenine Sites in the Rice Genome , 2021, Neurocomputing.

[29]  G. Bejerano,et al.  Enhancers: five essential questions , 2013, Nature Reviews Genetics.