Novel transformer networks for improved sequence labeling in genomics

In genomics, a wide range of machine learning methodologies have been investigated to annotate biological sequences for positions of interest such as transcription start sites, translation initiation sites, methylation sites, splice sites and promoter start sites. In recent years, this area has been dominated by convolutional neural networks, which typically outperform previously-designed methods as a result of automated scanning for influential sequence motifs. However, those architectures do not allow for the efficient processing of the full genomic sequence. As an improvement, we introduce transformer architectures for whole genome sequence labeling tasks. We show that these architectures, recently introduced for natural language processing, are better suited for processing and annotating long DNA sequences. We apply existing networks and introduce an optimized method for the calculation of attention from input nucleotides. To demonstrate this, we evaluate our architecture on several sequence labeling tasks, and find it to achieve state-of-the-art performances when comparing it to specialized models for the annotation of transcription start sites, translation initiation sites and 4mC methylation in E. coli.

[1]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[2]  V. Solovyev,et al.  Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks , 2016, PloS one.

[3]  D. Haussler,et al.  A hidden Markov model that finds genes in E. coli DNA. , 1994, Nucleic acids research.

[4]  Fan Yang,et al.  iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC , 2018, Bioinform..

[5]  Zhi Xie,et al.  MethSMRT: an integrative database for DNA N6-methyladenine and N4-methylcytosine generated by single-molecular real-time sequencing , 2016, Nucleic Acids Res..

[6]  Hao Lv,et al.  Identify origin of replication in Saccharomyces cerevisiae using two-step feature selection technique , 2018, Bioinform..

[7]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[8]  Hao Lin,et al.  Identifying Sigma70 Promoters with Novel Pseudo Nucleotide Composition , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[9]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[10]  W. Waegeman,et al.  DeepRibo: a neural network for precise gene annotation of prokaryotes by combining ribosome profiling signal and binding site patterns , 2019, Nucleic acids research.

[11]  Kil To Chong,et al.  4mCCNN: Identification of N4-Methylcytosine Sites in Prokaryotes Using Convolutional Neural Network , 2019, IEEE Access.

[12]  Kuo-Chen Chou,et al.  iRSpot-Pse6NC: Identifying recombination spots in Saccharomyces cerevisiae by incorporating hexamer composition into general PseKNC , 2018, International journal of biological sciences.

[13]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[14]  Neelam Goel,et al.  An Improved Method for Splice Site Prediction in DNA Sequences Using Support Vector Machines , 2015 .

[15]  Swakkhar Shatabda,et al.  iPro70-FMWin: identifying Sigma70 promoters using multiple windowing and minimal features , 2018, Molecular Genetics and Genomics.

[16]  Gwang Lee,et al.  PVP-SVM: Sequence-Based Prediction of Phage Virion Proteins Using a Support Vector Machine , 2018, Front. Microbiol..

[17]  Wei Chen,et al.  iRSpot-PseDNC: identify recombination spots with pseudo dinucleotide composition , 2013, Nucleic acids research.

[18]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[19]  R. Harr,et al.  Search algorithm for pattern match analysis of nucleic acid sequences. , 1983, Nucleic acids research.

[20]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[21]  A. A. Reilly,et al.  An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences , 1990, Proteins.

[22]  D. Ayers,et al.  Promoter recognition by Escherichia coli RNA polymerase. Role of the spacer DNA in functional complex formation. , 1989, Journal of molecular biology.

[23]  Dong Xu,et al.  Predicting DNA Methylation State of CpG Dinucleotide Using Genome Topological Features and Deep Networks , 2016, Scientific Reports.

[24]  Michael Q. Zhang,et al.  SCPD: a promoter database of the yeast Saccharomyces cerevisiae , 1999, Bioinform..

[25]  Mikhail A. Roytberg A search for common patterns in many sequences , 1992, Comput. Appl. Biosci..

[26]  Christophe Lefèvre,et al.  Pattern recognition in DNA sequences and its application to consensus foot-printing , 1993, Comput. Appl. Biosci..

[27]  W. Markiewicz,et al.  Effect of the sequence-dependent structure of the 17 bp AT spacer on the strength of consensuslike E.coli promoters in vivo. , 1989, Nucleic acids research.

[28]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[29]  Hai Zhao,et al.  CircSLNN: Identifying RBP-Binding Sites on circRNAs via Sequence Labeling Neural Networks , 2019, Front. Genet..

[30]  Manju Bansal,et al.  A novel method for prokaryotic promoter prediction based on DNA stability , 2005, BMC Bioinformatics.

[31]  K. Chou,et al.  iDNA6mA-PseKNC: Identifying DNA N6-methyladenosine sites by incorporating nucleotide physicochemical properties into PseKNC. , 2018, Genomics.

[32]  O. Stegle,et al.  DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning , 2016, Genome Biology.

[33]  M Kanehisa,et al.  An assessment of neural network and statistical approaches for prediction of E. coli promoter sites. , 1992, Nucleic acids research.

[34]  Thomas Colthurst,et al.  A universal SNP and small-indel variant caller using deep neural networks , 2018, Nature Biotechnology.

[35]  Byunghan Lee,et al.  Advance Access Publication Date: Day Month Year Manuscript Category Deeptarget: End-to-end Learning Framework for Microrna Target Prediction Using Deep Recurrent Neural Networks , 2022 .

[36]  Julio Collado-Vides,et al.  RegulonDB v 10.5: tackling challenges to unify classic and high throughput knowledge of gene regulation in E. coli K-12 , 2018, Nucleic Acids Res..

[37]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[38]  Wei Chen,et al.  iORI-PseKNC: A predictor for identifying origin of replication with pseudo k-tuple nucleotide composition , 2015 .

[39]  Astrid Gall,et al.  Ensembl 2019 , 2018, Nucleic Acids Res..

[40]  Robert D. Finn,et al.  Dfam: a database of repetitive DNA based on profile hidden Markov models , 2012, Nucleic Acids Res..

[41]  M Michael Gromiha,et al.  Seq2Feature: a comprehensive web-based feature extraction tool , 2019, Bioinform..

[42]  Wei Chen,et al.  iDNA4mC: identifying DNA N4‐methylcytosine sites based on nucleotide chemical properties , 2017, Bioinform..

[43]  G. Stormo,et al.  Identifying protein-binding sites from unaligned DNA fragments. , 1989, Proceedings of the National Academy of Sciences of the United States of America.

[44]  Sheng Wang,et al.  Image-based promoter prediction: a promoter prediction method based on evolutionarily generated patterns , 2018, Scientific Reports.