DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome

Deciphering the language of non-coding DNA is one of the fundamental problems in genome research. Gene regulatory code is highly complex due to the existence of polysemy and distant semantic relationship, which previous informatics methods often fail to capture especially in data-scarce scenarios. To address this challenge, we developed a novel pre-trained bidirectional encoder representation, named DNABERT, that forms global and transferrable understanding of genomic DNA sequences based on up and downstream nucleotide contexts. We show that the single pre-trained transformers model can simultaneously achieve state-of-the-art performance on many sequence predictions tasks, after easy fine-tuning using small task-specific data. Further, DNABERT enables direct visualization of nucleotide-level importance and semantic relationship within input sequences for better interpretability and accurate identification of conserved sequence motifs and functional genetic variants. Finally, we demonstrate that pre-trained DNABERT with human genome can even be readily applied to other organisms with exceptional performance.

[1]  Shaojie Qiao,et al.  DeepSite: bidirectional LSTM and CNN models for predicting DNA–protein binding , 2019, International Journal of Machine Learning and Cybernetics.

[2]  Ramana V. Davuluri,et al.  In silico analysis of alternative splicing on drug-target gene interactions , 2020, Scientific Reports.

[3]  Jaewoo Kang,et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining , 2019, Bioinform..

[4]  Ruohan Wang,et al.  SpliceFinder: ab initio prediction of splice sites using convolutional neural network , 2019, BMC Bioinformatics.

[5]  A. Sandelin,et al.  Determinants of enhancer and promoter activities of regulatory elements , 2019, Nature Reviews Genetics.

[6]  Fei Li,et al.  Fine-Tuning Bidirectional Encoder Representations From Transformers (BERT)–Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study , 2019, JMIR medical informatics.

[7]  Yu Li,et al.  Promoter analysis and prediction in the human genome using sequence-based deep learning models , 2019, Bioinform..

[8]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[9]  Jesse Vig,et al.  A Multiscale Visualization of Attention in the Transformer Model , 2019, ACL.

[10]  Kil To Chong,et al.  DeePromoter: Robust Promoter Predictor Using Deep Learning , 2019, Front. Genet..

[11]  Helen E. Parkinson,et al.  The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019 , 2018, Nucleic Acids Res..

[12]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[13]  M. Huss,et al.  A primer on deep learning in genomics , 2018, Nature Genetics.

[14]  De-Shuang Huang,et al.  Recurrent Neural Network for Predicting Transcription Factor Binding Sites , 2018, Scientific Reports.

[15]  D. Yan,et al.  Interaction of polymorphisms in xeroderma pigmentosum group C with cigarette smoking and pancreatic cancer risk , 2018, Oncology letters.

[16]  Abdullah M. Khamis,et al.  A novel method for improved accuracy of transcription factor binding site prediction , 2018, Nucleic acids research.

[17]  R. O’Malley,et al.  Mapping genome-wide transcription-factor binding sites using DAP-seq , 2017, Nature Protocols.

[18]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[19]  Albin Sandelin,et al.  The Landscape of Isoform Switches in Human Cancers , 2017, Molecular Cancer Research.

[20]  Feng Xu,et al.  Predicting regulatory variants with composite statistic , 2016, Bioinform..

[21]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[22]  Xiaohui S. Xie,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015, bioRxiv.

[23]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[24]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[25]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[26]  S. Gerstberger,et al.  A census of human RNA-binding proteins , 2014, Nature Reviews Genetics.

[27]  Richard Leslie,et al.  GRASP: analysis of genotype-phenotype results from 1390 genome-wide association studies and corresponding open access database , 2014, Bioinform..

[28]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[29]  Deanna M. Church,et al.  ClinVar: public archive of relationships among sequence variation and human phenotype , 2013, Nucleic Acids Res..

[30]  Howard Y. Chang,et al.  Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position , 2013, Nature Methods.

[31]  Giovanna Ambrosini,et al.  EPD and EPDnew, high-quality promoter resources in the next-generation sequencing era , 2012, Nucleic Acids Res..

[32]  David Haussler,et al.  ENCODE Data in the UCSC Genome Browser: year 5 update , 2012, Nucleic Acids Res..

[33]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[34]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[35]  Raymond K. Auerbach,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[36]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[37]  Philip Cayting,et al.  An encyclopedia of mouse DNA elements (Mouse ENCODE) , 2012, Genome Biology.

[38]  Job Dekker,et al.  The context of gene expression regulation , 2012, F1000 biology reports.

[39]  H. Stunnenberg,et al.  Crosstalk between c-Jun and TAp73α/β contributes to the apoptosis–survival balance , 2011, Nucleic acids research.

[40]  C. Burge,et al.  Splicing regulation: from a parts list of regulatory elements to an integrated splicing code. , 2008, RNA.

[41]  Sumio Sugano,et al.  The functional consequences of alternative promoter use in mammalian genomes. , 2008, Trends in genetics : TIG.

[42]  E. Aller,et al.  MYO7A mutation screening in Usher syndrome type I patients from diverse origins , 2006, Journal of Medical Genetics.

[43]  William Stafford Noble,et al.  Quantifying similarity between motifs , 2007, Genome Biology.

[44]  V. Solovyev,et al.  Automatic annotation of eukaryotic genes, pseudogenes and promoters , 2006, Genome Biology.

[45]  A. Ballabio,et al.  The Multiple Sulfatase Deficiency Gene Encodes an Essential and Limiting Factor for the Activity of Sulfatases , 2003, Cell.

[46]  R. Davuluri Application of FirstEF to Find Promoters and First Exons in the Human Genome , 2003, Current protocols in bioinformatics.

[47]  Colin N. Dewey,et al.  Initial sequencing and comparative analysis of the mouse genome. , 2002 .

[48]  F. Wright,et al.  Gene expression profiling of isogenic cells with different TP53 gene dosage reveals numerous genes that are affected by TP53 dosage and identifies CSPG2 as a direct target of p53 , 2002, Proceedings of the National Academy of Sciences of the United States of America.

[49]  D. Searls,et al.  Robots in invertebrate neuroscience , 2002, Nature.

[50]  Olivier Gascuel,et al.  Proceedings of the First International Workshop on Algorithms in Bioinformatics , 2001 .

[51]  Elizabeth M. Smigielski,et al.  dbSNP: the NCBI database of genetic variation , 2001, Nucleic Acids Res..

[52]  S Ji,et al.  The Linguistics of DNA: Words, Sentences, Grammar, Phonetics, and Semantics , 1999, Annals of the New York Academy of Sciences.

[53]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[54]  H E Stanley,et al.  Linguistic features of noncoding DNA sequences. , 1994, Physical review letters.

[55]  Shumeet Baluja,et al.  Advances in Neural Information Processing , 1994 .

[56]  David B. Searls,et al.  The Linguistics of DNA , 1992 .

[57]  Tom Head,et al.  Formal language theory and DNA: An analysis of the generative capacity of specific recombinant behaviors , 1987 .

[58]  T. Head Formal language theory and DNA: an analysis of the generative capacity of specific recombinant behaviors. , 1987, Bulletin of mathematical biology.

[59]  V. Brendel,et al.  Genome structure described by formal languages. , 1984, Nucleic acids research.

[60]  M Nirenberg,et al.  RNA codewords and protein synthesis, VII. On the general nature of the RNA code. , 1965, Proceedings of the National Academy of Sciences of the United States of America.