Deep learning at base-resolution reveals motif syntax of the cis-regulatory code

Genes are regulated through enhancer sequences, in which transcription factor binding motifs and their specific arrangements (syntax) form a cis-regulatory code. To understand the relationship between motif syntax and transcription factor binding, we train a deep learning model that uses DNA sequence to predict base-resolution binding profiles of four pluripotency transcription factors Oct4, Sox2, Nanog, and Klf4. We interpret the model to accurately map hundreds of thousands of motifs in the genome, learn novel motif representations and identify rules by which motifs and syntax influence transcription factor binding. We find that instances of strict motif spacing are largely due to retrotransposons, but that soft motif syntax influences motif interactions at protein and nucleosome range. Most strikingly, Nanog binding is driven by motifs with a strong preference for ∼10.5 bp spacings corresponding to helical periodicity. Interpreting deep learning models applied to high-resolution binding data is a powerful and versatile approach to uncover the motifs and syntax of cis-regulatory sequences.

[1]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[2]  H R Drew,et al.  Structure of a B-DNA dodecamer: conformation and dynamics. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[3]  D. Stillman,et al.  Specific interactions of Saccharomyces cerevisiae proteins with a promoter region of eukaryotic tRNA genes. , 1982, Proceedings of the National Academy of Sciences of the United States of America.

[4]  T. D. Schneider,et al.  Information content of binding sites on nucleotide sequences. , 1986, Journal of molecular biology.

[5]  A DNA-binding domain of human transcription factor IIIC2. , 1989, Nucleic acids research.

[6]  Frederick P. Brooks,et al.  Computing smooth molecular surfaces , 1994, IEEE Computer Graphics and Applications.

[7]  T. Maniatis,et al.  Virus induction of human IFNβ gene expression requires the assembly of an enhanceosome , 1995, Cell.

[8]  C C Adams,et al.  Binding of disparate transcriptional activators to nucleosomal DNA is inherently cooperative , 1995, Molecular and cellular biology.

[9]  O. Wrange,et al.  Accessibility of a glucocorticoid response element in a nucleosome depends on its rotational positioning , 1995, Molecular and cellular biology.

[10]  K Schulten,et al.  VMD: visual molecular dynamics. , 1996, Journal of molecular graphics.

[11]  M. Levine,et al.  Long-range repression in the Drosophila embryo. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[12]  B. Müller-Hill,et al.  Repression of lac promoter as a function of distance, phase and quality of an auxiliary lac operator. , 1996, Journal of molecular biology.

[13]  Benno Müller-Hill,et al.  Repression oflacPromoter as a Function of Distance, Phase and Quality of an AuxiliarylacOperator , 1996 .

[14]  T. Richmond,et al.  Crystal structure of the nucleosome core particle at 2.8 Å resolution , 1997, Nature.

[15]  D. Ambrosetti,et al.  Synergistic activation of the fibroblast growth factor 4 enhancer by Sox2 and Oct-3 depends on protein-protein interactions facilitated by a specific spatial arrangement of factor binding sites , 1997, Molecular and cellular biology.

[16]  G Vriend,et al.  New POU dimer configuration mediates antagonistic control of an osteopontin preimplantation enhancer by Oct-4 and Sox-2. , 1998, Genes & development.

[17]  Matthias Wilmanns,et al.  Synergism with the Coactivator OBF-1 (OCA-B, BOB-1) Is Mediated by a Specific POU Dimer Configuration , 2000, Cell.

[18]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[19]  Kathleen Marchal,et al.  A higher-order background model improves the detection of promoter regulatory elements by Gibbs sampling , 2001, Bioinform..

[20]  Tommi S. Jaakkola,et al.  Fast optimal leaf ordering for hierarchical clustering , 2001, ISMB.

[21]  M. Murakami,et al.  The Homeoprotein Nanog Is Required for Maintenance of Pluripotency in Mouse Epiblast and ES Cells , 2003, Cell.

[22]  Shinya Yamanaka,et al.  Fbx15 Is a Novel Target of Oct3/4 but Is Dispensable for Embryonic Stem Cell Self-Renewal and Mouse Development , 2003, Molecular and Cellular Biology.

[23]  Matthias Wilmanns,et al.  Crystal structure of a POU/HMG/DNA ternary complex suggests differential assembly of Oct4 and Sox2 on two enhancers. , 2003, Genes & development.

[24]  D. Arnosti,et al.  Information display by transcriptional enhancers , 2003, Development.

[25]  Dmitrij Frishman,et al.  STRIDE: a web server for secondary structure assignment from known atomic coordinates of proteins , 2004, Nucleic Acids Res..

[26]  G. Marius Clore,et al.  Molecular Basis for Synergistic Transcriptional Activation by Oct1 and Sox2 Revealed from the Solution Structure of the 42-kDa Oct1·Sox2·Hoxb1-DNA Ternary Transcription Factor Complex* , 2004, Journal of Biological Chemistry.

[27]  Michael Levine,et al.  Coordinate enhancers share common organizational features in the Drosophila genome. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[28]  Graziano Pesole,et al.  Weeder Web: discovery of transcription factor binding sites in a set of sequences from co-regulated genes , 2004, Nucleic Acids Res..

[29]  D. Haussler,et al.  Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. , 2005, Genome research.

[30]  Xi Chen,et al.  Reciprocal Transcriptional Regulation of Pou5f1 and Sox2 via the Oct4/Sox2 Complex in Embryonic Stem Cells , 2005, Molecular and Cellular Biology.

[31]  M. Kimura A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences , 1980, Journal of Molecular Evolution.

[32]  Mark Ptashne,et al.  Regulation of transcription: from lambda to eukaryotes. , 2005, Trends in biochemical sciences.

[33]  X. Chen,et al.  The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells , 2006, Nature Genetics.

[34]  M. Levine,et al.  Computational Models for Neurogenic Gene Expression in the Drosophila Embryo , 2006, Current Biology.

[35]  Alexander E. Kel,et al.  TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes , 2005, Nucleic Acids Res..

[36]  N. D. Clarke,et al.  Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells , 2008, Cell.

[37]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[38]  S. Orkin,et al.  Requirement of Nanog dimerization for stem cell self-renewal and pluripotency , 2008, Proceedings of the National Academy of Sciences.

[39]  E. Liu,et al.  Evolution of the mammalian transcription factor binding repertoire via transposable elements. , 2008, Genome research.

[40]  P. Park,et al.  Design and analysis of ChIP-seq experiments for DNA-binding proteins , 2008, Nature Biotechnology.

[41]  Raluca Gordân,et al.  Distinguishing direct versus indirect transcription factor-DNA interactions. , 2009, Genome research.

[42]  Saurabh Sinha,et al.  A Biophysical Model for Analysis of Transcription Factor Interaction and Binding Site Arrangement from Genome-Wide Binding Data , 2009, PloS one.

[43]  Dmitri Papatsenko,et al.  Organization of developmental enhancers in the Drosophila embryo , 2009, Nucleic acids research.

[44]  A. Stathopoulos,et al.  Design flexibility in cis-regulatory control of gene expression: synthetic and comparative evidence. , 2009, Developmental biology.

[45]  Raymond K. Auerbach,et al.  PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls , 2009, Nature Biotechnology.

[46]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[47]  Richard Durbin,et al.  Sequence analysis Fast and accurate short read alignment with Burrows – Wheeler transform , 2009 .

[48]  Mikael Bodén,et al.  MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..

[49]  Richard W. Lusk,et al.  Evolutionary Mirages: Selection on Binding Site Composition Creates the Illusion of Conserved Grammars in Drosophila Enhancers , 2010, PLoS genetics.

[50]  L. Mirny,et al.  Nucleosome-mediated cooperativity between transcription factors , 2009, Proceedings of the National Academy of Sciences.

[51]  Yuchun Guo,et al.  Discovering homotypic binding events at high spatial resolution , 2010, Bioinform..

[52]  Aaron R. Quinlan,et al.  BIOINFORMATICS APPLICATIONS NOTE , 2022 .

[53]  G. Bourque,et al.  Transposable elements have rewired the core regulatory network of human embryonic stem cells , 2010, Nature Genetics.

[54]  Z. Weng,et al.  Genomic Binding Profiles of Functionally Distinct RNA Polymerase III Transcription Complexes in Human Cells , 2010, Nature Structural &Molecular Biology.

[55]  Heidi Dvinge,et al.  PeakAnalyzer: Genome-wide annotation of chromatin binding and modification loci , 2010, BMC Bioinformatics.

[56]  S. Barolo,et al.  Structural rules and complex regulatory circuitry constrain expression of a Notch- and EGFR-regulated eye enhancer. , 2010, Developmental cell.

[57]  Galt P. Barber,et al.  BigWig and BigBed: enabling browsing of large distributed datasets , 2010, Bioinform..

[58]  Chad A. Cowan,et al.  Rewirable gene regulatory networks in the preimplantation embryonic development of three mammalian species. , 2010, Genome research.

[59]  Marcel Martin Cutadapt removes adapter sequences from high-throughput sequencing reads , 2011 .

[60]  Peter J. Bickel,et al.  Measuring reproducibility of high-throughput experiments , 2011, 1110.4705.

[61]  P. Donovan,et al.  A Novel Role for an RNA Polymerase III Subunit POLR3G in Regulating Pluripotency in Human Embryonic Stem Cells , 2011, Stem cells.

[62]  Michael A. Beer,et al.  Discriminative prediction of mammalian enhancers from DNA sequence. , 2011, Genome research.

[63]  Sündüz Keleş,et al.  A Statistical Framework for the Analysis of ChIP-Seq Data , 2011, Journal of the American Statistical Association.

[64]  B. Pugh,et al.  Comprehensive Genome-wide Protein-DNA Interactions Detected at Single-Nucleotide Resolution , 2011, Cell.

[65]  Yuchun Guo,et al.  High Resolution Genome Wide Binding Event Finding and Motif Discovery Reveals Transcription Factor Spatial Binding Constraints , 2012, PLoS Comput. Biol..

[66]  J. Posakony,et al.  Role of Architecture in the Function and Specificity of Two Notch-Regulated Transcriptional Enhancer Modules , 2012, PLoS genetics.

[67]  Marc D. Perry,et al.  ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia , 2012, Genome research.

[68]  E. Birney,et al.  A Transcription Factor Collective Defines Cardiac Cell Fate and Reflects Lineage History , 2012, Cell.

[69]  David Z. Chen,et al.  Architecture of the human regulatory network derived from ENCODE data , 2012, Nature.

[70]  V. Praz,et al.  Genomic Study of RNA Polymerase II and III SNAPc-Bound Promoters Reveals a Gene Transcribed by Both Enzymes and a Broad Use of Common Activators , 2012, PLoS genetics.

[71]  Greg Donahue,et al.  Facilitators and Impediments of the Pluripotency Reprogramming Factors' Initial Engagement with the Genome , 2012, Cell.

[72]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[73]  David A. Orlando,et al.  Enhancer decommissioning by LSD1 during embryonic stem cell differentiation , 2012, Nature.

[74]  T. Bailey,et al.  Inferring direct DNA binding from ChIP-seq , 2012, Nucleic acids research.

[75]  Z. Yakhini,et al.  Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters , 2012, Nature Biotechnology.

[76]  William Stafford Noble,et al.  Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors , 2012, Genome research.

[77]  K. Struhl,et al.  Determinants of nucleosome positioning , 2013, Nature Structural &Molecular Biology.

[78]  T. Furey,et al.  DNase-seq predicts regions of rotational nucleosome stability across diverse human cell types , 2013, Genome research.

[79]  Juan M. Vaquerizas,et al.  DNA-Binding Specificities of Human Transcription Factors , 2013, Cell.

[80]  Ian Chambers,et al.  A direct physical interaction between Nanog and Sox2 regulates embryonic stem cell self-renewal , 2013, The EMBO journal.

[81]  Charles Blatti,et al.  Computational Identification of Diverse Mechanisms Underlying Transcription Factor-DNA Occupancy , 2013, PLoS genetics.

[82]  Lijiang Yang,et al.  Probing Allostery Through DNA , 2013, Science.

[83]  Jerzy Tiuryn,et al.  Comprehensive prediction in 78 human cell lines reveals rigidity and compactness of transcription factor dimers , 2013, Genome research.

[84]  Victor B. Zhurkin,et al.  Rotational positioning of nucleosomes facilitates selective binding of p53 to response elements associated with cell cycle arrest , 2013, Nucleic acids research.

[85]  Felicia S. L. Ng,et al.  Constrained transcription factor spacing is prevalent and important for transcriptional control of mouse blood cells , 2014, Nucleic Acids Research.

[86]  Lars Hufnagel,et al.  Subtle Changes in Motif Positioning Cause Tissue-Specific Effects on Robustness of an Enhancer's Activity , 2014, PLoS genetics.

[87]  Matthew Slattery,et al.  Absence of a simple code: how transcription factors read the genome. , 2014, Trends in biochemical sciences.

[88]  Uwe Ohler,et al.  Explicit DNase sequence bias modeling enables high-resolution transcription factor footprint detection , 2014, Nucleic acids research.

[89]  Felipe Merino,et al.  Cooperative DNA Recognition Modulated by an Interplay between Protein-Protein Interactions and DNA-Mediated Allostery , 2015, PLoS Comput. Biol..

[90]  P. Robson,et al.  Selective influence of Sox2 on POU transcription factor binding in embryonic and neural stem cells , 2015, EMBO reports.

[91]  M. Pellegrini,et al.  Pioneer Transcription Factors Target Partial DNA Motifs on Nucleosomes to Initiate Reprogramming , 2015, Cell.

[92]  Mitchell D. Miller,et al.  Structure-based discovery of NANOG variant with enhanced properties to promote self-renewal and reprogramming of pluripotent stem cells , 2015, Proceedings of the National Academy of Sciences.

[93]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[94]  Wei Zhang,et al.  Suboptimization of developmental enhancers , 2015, Science.

[95]  Julia Zeitlinger,et al.  ChIP-nexus: a novel ChIP-exo protocol for improved detection of in vivo transcription factor binding footprints , 2014, Nature Biotechnology.

[96]  J. Zeitlinger,et al.  Zelda overcomes the high intrinsic nucleosome barrier at enhancers during Drosophila zygotic genome activation , 2015, Genome research.

[97]  A. Jolma,et al.  DNA-dependent formation of transcription factor pairs alters their binding specificity , 2015, Nature.

[98]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[99]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[100]  Céline Hernandez,et al.  ChIP-exo signal associated with DNA-binding motifs provides insight into the genomic binding of the glucocorticoid receptor and cooperating transcription factors , 2015, Genome research.

[101]  Michael Q. Zhang,et al.  Integrative analysis of 111 reference human epigenomes , 2015, Nature.

[102]  J. Krijgsveld,et al.  Expanding the Circuitry of Pluripotency by Selective Isolation of Chromatin-Associated Proteins , 2016, Molecular cell.

[103]  B. Cohen,et al.  Interactions between pluripotency factors specify cis-regulation in embryonic stem cells , 2016, Genome research.

[104]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[105]  A. Pozner,et al.  PAtCh-Cap: input strategy for improving analysis of ChIP-exo data sets and beyond , 2016, Nucleic acids research.

[106]  Patricia P. Chan,et al.  GtRNAdb 2.0: an expanded database of transfer RNA genes identified in complete and draft genomes , 2015, Nucleic Acids Res..

[107]  Jianling Zhong,et al.  Mapping nucleosome positions using DNase-seq , 2016, Genome research.

[108]  Bin Xiong,et al.  Insights into Nucleosome Organization in Mouse Embryonic Stem Cells through Chemical Mapping , 2016, Cell.

[109]  Jacqueline M. Dresch,et al.  Quantitative perturbation-based analysis of gene expression predicts enhancer activity in early Drosophila embryo , 2016, eLife.

[110]  Richard J Maraia,et al.  RNA Polymerase III Advances: Structural and tRNA Functional Views. , 2016, Trends in biochemical sciences.

[111]  S. Mango,et al.  Pioneer transcription factors, chromatin dynamics, and cell fate control. , 2016, Current opinion in genetics & development.

[112]  Heiga Zen,et al.  WaveNet: A Generative Model for Raw Audio , 2016, SSW.

[113]  Jun S. Song,et al.  Categorical spectral analysis of periodicity in nucleosomal DNA , 2016, Nucleic acids research.

[114]  Teemu Kivioja,et al.  PeakXus: comprehensive transcription factor binding site discovery from ChIP-Nexus and ChIP-Exo experiments , 2016, Bioinform..

[115]  Avanti Shrikumar,et al.  Learning Important Features Through Propagating Activation Differences , 2017, ICML.

[116]  M. Bulyk,et al.  Identification of Human Lineage-Specific Transcriptional Coregulators Enabled by a Glossary of Binding Modules and Tunable Genomic Backgrounds. , 2017, Cell systems.

[117]  J. Zeitlinger,et al.  Drosophila poised enhancers are generated during tissue patterning with the help of repression , 2016, bioRxiv.

[118]  A. Rowe,et al.  Distinct Contributions of Tryptophan Residues within the Dimerization Domain to Nanog Function , 2017, Journal of molecular biology.

[119]  Jennifer A. Mitchell,et al.  Enhancers and super-enhancers have an equivalent regulatory role in embryonic stem cells through regulation of single or multiple genes , 2017, Genome research.

[120]  Justin Crocker,et al.  Using synthetic biology to study gene regulatory evolution. , 2017, Current opinion in genetics & development.

[121]  Charles J. Vaske,et al.  Predicting DNA accessibility in the pan-cancer tumor genome using RNA-seq, WGS, and deep learning , 2017 .

[122]  Sharon E. Torigoe,et al.  A dynamic interplay of enhancer elements regulates Klf4 expression in naïve pluripotency , 2017, Genes & development.

[123]  Beilun Wang,et al.  Deep Motif Dashboard: Visualizing and Understanding Genomic Sequences Using Deep Neural Networks , 2016, PSB.

[124]  Daniel Quang,et al.  FactorNet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data , 2017, bioRxiv.

[125]  E. Morgunova,et al.  Structural perspective of cooperative transcription factor binding. , 2017, Current opinion in structural biology.

[126]  Ting Wang,et al.  Functional cis-regulatory modules encoded by mouse-specific endogenous retrovirus , 2017, Nature Communications.

[127]  Jonathan M. Cairns,et al.  Long-Range Enhancer Interactions Are Prevalent in Mouse Embryonic Stem Cells and Are Reorganized upon Pluripotent State Transition , 2018, Cell reports.

[128]  David R. Kelley,et al.  Sequential regulatory activity prediction across chromosomes with convolutional neural networks. , 2018, Genome research.

[129]  S. Tomlinson,et al.  Esrrb extinction triggers dismantling of naïve pluripotency and marks commitment to differentiation , 2018, The EMBO journal.

[130]  Jun Cheng,et al.  Modeling positional effects of regulatory sequences with spline transformations increases prediction accuracy of deep neural networks , 2017, bioRxiv.

[131]  Biswajyoti Sahu,et al.  The interaction landscape between transcription factors and the nucleosome , 2018, Nature.

[132]  P. Robson,et al.  Dynamic changes in Sox2 spatio-temporal expression promote the second cell fate decision through Fgf4/Fgfr2 signaling in preimplantation mouse embryos , 2016, bioRxiv.

[133]  Shaun Mahony,et al.  Characterizing protein-DNA binding event subtypes in ChIP-exo data , 2018, bioRxiv.

[134]  Anshul Kundaje,et al.  Discovering epistatic feature interactions from neural network models of regulatory DNA sequences , 2018, bioRxiv.

[135]  Z. Paroush,et al.  Capicua controls Toll/IL-1 signaling targets independently of RTK regulation , 2018, Proceedings of the National Academy of Sciences.

[136]  B. Cohen,et al.  Synthetic and genomic regulatory elements reveal aspects of cis regulatory grammar in Mouse Embryonic Stem Cells , 2018, bioRxiv.

[137]  Srinivas Ramachandran,et al.  Precise genome-wide mapping of single nucleosomes and linkers in vivo , 2018, Genome Biology.

[138]  Anna Shcherbina,et al.  TF-MoDISco v0.4.4.2-alpha: Technical Note , 2018, ArXiv.

[139]  Jun Cheng,et al.  The Kipoi repository accelerates community exchange and reuse of predictive models for genomics , 2019, Nature Biotechnology.

[140]  Yoseph Barash,et al.  Improving interpretability of deep learning models: splicing codes as a case study , 2019 .

[141]  C. Todd,et al.  Functional evaluation of transposable elements as enhancers in mouse embryonic and trophoblast stem cells , 2019, eLife.

[142]  Georg Seelig,et al.  A Deep Neural Network for Predicting and Engineering Alternative Polyadenylation , 2019, Cell.

[143]  David G. Knowles,et al.  Predicting Splicing from Primary Sequence with Deep Learning , 2019, Cell.

[144]  Sangdun Choi,et al.  Structural mechanism of DNA-mediated Nanog–Sox2 cooperative interaction , 2019, RSC advances.

[145]  Steven Henikoff,et al.  Pioneer Factor-Nucleosome Binding Events during Differentiation Are Motif Encoded. , 2019, Molecular cell.

[146]  L. Yampolsky,et al.  Pou5f3, SoxB1, and Nanog remodel chromatin on high nucleosome affinity regions at zygotic genome activation , 2018, bioRxiv.

[147]  Howard Y. Chang,et al.  Satb1 integrates DNA binding site geometry and torsional stress to differentially target nucleosome-dense regions , 2019, Nature Communications.