Enhancer Identification using Transfer and Adversarial Deep Learning of DNA Sequences

Enhancer sequences regulate the expression of genes from afar by providing a binding platform for transcription factors, often in a tissue-specific or context-specific manner. Despite their importance in health and disease, our understanding of these DNA sequences, and their regulatory grammar, is limited. This impairs our ability to identify new enhancers along the genome, or to understand the effect of enhancer mutations and their role in genetic diseases. We trained deep Convolutional Neural Networks (CNN) to identify enhancer sequences in multiple species. We used multiple biological datasets, including simulated sequences, in vivo binding data of single transcription factors and genome-wide chromatin maps of active enhancers in 17 mammalian species. Our deep networks obtained high classification accuracy by combining two training strategies: First, training on enhancers vs. non-enhancer background sequences, we identified short (1-4bp) low-complexity motifs. Second, by replacing the negative training set by adversarial k-order random shuffles of enhancer sequences (thus maintaining base composition while shuttering longer motifs, including transcription factor binding sites), we identified a set of biologically meaningful motifs, unique to enhancers. In addition, classification performance improved when combining positive data from all species together, showing a shared mammalian regulatory architecture. Our results demonstrate that design of adversarial training data, and transfer of learned parameters between networks trained on different species/datasets improve the overall performance and capture biologically meaningful information in the parameters of the learned network. Contact: or.zuk@mail.huji.ac.il, tommy@cs.huji.ac.il

[1]  Nir Friedman,et al.  Modeling dependencies in protein-DNA binding sites , 2003, RECOMB '03.

[2]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[3]  Ning Chen,et al.  Predicting enhancers with deep convolutional neural networks , 2017, BMC Bioinformatics.

[4]  Edward M. Rubin,et al.  Deletion of a coordinate regulator of type 2 cytokine expression in mice , 2001, Nature Immunology.

[5]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[6]  G. Bejerano,et al.  Enhancers: five essential questions , 2013, Nature Reviews Genetics.

[7]  Michael D. Wilson,et al.  Species-Specific Transcription in Mice Carrying Human Chromosome 21 , 2008, Science.

[8]  Nadav Ahituv,et al.  Exonic enhancers: proceed with caution in exome and genome sequencing studies , 2016, Genome Medicine.

[9]  Ziv Bar-Joseph,et al.  DECOD: fast and accurate discriminative DNA motif finding , 2011, Bioinform..

[10]  Anne E Carpenter,et al.  Opportunities and obstacles for deep learning in biology and medicine , 2017, bioRxiv.

[11]  Minghui Jiang,et al.  uShuffle: A useful tool for shuffling biological sequences while preserving the k-let counts , 2008, BMC Bioinformatics.

[12]  I. Ovcharenko,et al.  Identifying regulatory elements in eukaryotic genomes. , 2009, Briefings in functional genomics & proteomics.

[13]  Gerald Stampfel,et al.  Dissection of thousands of cell type-specific enhancers identifies dinucleotide repeat motifs as general enhancer features , 2014, Genome research.

[14]  Bing Ren,et al.  Tissue-specific SMARCA4 binding at active and repressed regulatory elements during embryogenesis , 2014, Genome research.

[15]  Avanti Shrikumar,et al.  Reverse-complement parameter sharing improves deep learning models for genomics , 2017, bioRxiv.

[16]  Michael Levine,et al.  Enhancer Control of Transcriptional Bursting , 2016, Cell.

[17]  V. Corces,et al.  Enhancer function: new insights into the regulation of tissue-specific gene expression , 2011, Nature Reviews Genetics.

[18]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[19]  Peter Winkler,et al.  Shuffling Biological Sequences , 1996, Discret. Appl. Math..

[20]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[21]  G. Church,et al.  Nucleotides of transcription factor binding sites exert interdependent effects on the binding affinities of transcription factors. , 2002, Nucleic acids research.

[22]  R. Young,et al.  Histone H3K27ac separates active from poised enhancers and predicts developmental state , 2010, Proceedings of the National Academy of Sciences.

[23]  Philip Machanick,et al.  MEME-ChIP: motif analysis of large DNA datasets , 2011, Bioinform..

[24]  William Stafford Noble,et al.  Motif-based analysis of large nucleotide data sets using MEME-ChIP , 2014, Nature Protocols.

[25]  Michael D. Wilson,et al.  Five-Vertebrate ChIP-seq Reveals the Evolutionary Dynamics of Transcription Factor Binding , 2010, Science.

[26]  Albert J. Vilella,et al.  A high-resolution map of human evolutionary constraint using 29 mammals , 2011, Nature.

[27]  C. Glass,et al.  Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. , 2010, Molecular cell.

[28]  Daniel L. K. Yamins,et al.  Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object Recognition , 2014, PLoS Comput. Biol..

[29]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[30]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[31]  Wilfred W. Li,et al.  MEME: discovering and analyzing DNA and protein sequence motifs , 2006, Nucleic Acids Res..

[32]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[33]  Martin J. Wainwright,et al.  Noisy matrix decomposition via convex relaxation: Optimal rates in high dimensions , 2011, ICML.

[34]  W. Wasserman,et al.  Genome-wide prediction of cis-regulatory regions using supervised deep learning methods , 2016, BMC Bioinformatics.

[35]  Chun-Hsi Huang,et al.  A survey of motif finding Web tools for detecting binding site motifs in ChIP-Seq data , 2014, Biology Direct.

[36]  J. T. Erichsen,et al.  Enhancer Evolution across 20 Mammalian Species , 2015, Cell.

[37]  Axel Visel,et al.  Progressive Loss of Function in a Limb Enhancer during Snake Evolution , 2016, Cell.

[38]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[39]  G. Stormo,et al.  Non-independence of Mnt repressor-operator interaction determined by a new quantitative multiple fluorescence relative affinity (QuMFRA) assay. , 2001, Nucleic acids research.

[40]  G. Stormo,et al.  Additivity in protein-DNA interactions: how good an approximation is it? , 2002, Nucleic acids research.

[41]  Beilun Wang,et al.  Deep Motif Dashboard: Visualizing and Understanding Genomic Sequences Using Deep Neural Networks , 2016, PSB.

[42]  Eleazar Eskin,et al.  The Spectrum Kernel: A String Kernel for SVM Protein Classification , 2001, Pacific Symposium on Biocomputing.

[43]  Charu C. Aggarwal,et al.  Neural Networks and Deep Learning , 2018, Springer International Publishing.

[44]  Astrid Gall,et al.  Ensembl 2018 , 2017, Nucleic Acids Res..

[45]  Hiroshi Masuya,et al.  A series of ENU-induced single-base substitutions in a long-range cis-element altering Sonic hedgehog expression in the developing mouse limb bud. , 2007, Genomics.

[46]  J. Capra,et al.  Short DNA sequence patterns accurately identify broadly active human enhancers , 2017, BMC Genomics.

[47]  Razvan Pascanu,et al.  On the number of response regions of deep feed forward networks with piece-wise linear activations , 2013, 1312.6098.

[48]  Chris M Rands,et al.  8.2% of the Human Genome Is Constrained: Variation in Rates of Turnover across Functional Element Classes in the Human Lineage , 2014, PLoS genetics.

[49]  Yoshua. Bengio,et al.  Learning Deep Architectures for AI , 2007, Found. Trends Mach. Learn..

[50]  Richard M Myers,et al.  Decoding transcriptional enhancers: Evolving from annotation to functional interpretation. , 2016, Seminars in cell & developmental biology.