An expansive human regulatory lexicon encoded in transcription factor footprints

Regulatory factor binding to genomic DNA protects the underlying sequence from cleavage by DNase I, leaving nucleotide-resolution footprints. Using genomic DNase I footprinting across 41 diverse cell and tissue types, we detected 45 million transcription factor occupancy events within regulatory regions, representing differential binding to 8.4 million distinct short sequence elements. Here we show that this small genomic sequence compartment, roughly twice the size of the exome, encodes an expansive repertoire of conserved recognition sequences for DNA-binding proteins that nearly doubles the size of the human cis–regulatory lexicon. We find that genetic variants affecting allelic chromatin states are concentrated in footprints, and that these elements are preferentially sheltered from DNA methylation. High-resolution DNase I cleavage patterns mirror nucleotide-level evolutionary conservation and track the crystallographic topography of protein–DNA interfaces, indicating that transcription factor structure has been evolutionarily imprinted on the human genome sequence. We identify a stereotyped 50-base-pair footprint that precisely defines the site of transcript origination within thousands of human promoters. Finally, we describe a large collection of novel regulatory factor recognition motifs that are highly conserved in both sequence and function, and exhibit cell-selective occupancy patterns that closely parallel major regulators of development, differentiation and pluripotency.

[1]  W. Gilbert,et al.  ISOLATION OF THE LAC REPRESSOR , 1966, Proceedings of the National Academy of Sciences of the United States of America.

[2]  D. Galas,et al.  DNAse footprinting: a simple method for the detection of protein-DNA binding specificity. , 1978, Nucleic acids research.

[3]  M. Nei,et al.  Mathematical model for studying genetic variation in terms of restriction endonucleases. , 1979, Proceedings of the National Academy of Sciences of the United States of America.

[4]  R. Tjian,et al.  The promoter-specific transcription factor Sp1 binds to upstream sequences in the SV40 early promoter , 1983, Cell.

[5]  R. Treisman,et al.  Simian virus 40 enhancer increases number of RNA polymerase II molecules on linked DNA , 1985, Nature.

[6]  G. Kollias,et al.  Position-independent, high-level expression of the human β-globin gene in transgenic mice , 1987, Cell.

[7]  D. S. Gross,et al.  Nuclease hypersensitive sites in chromatin. , 1988, Annual review of biochemistry.

[8]  Shih-Feng Tsai,et al.  Cloning of cDNA for the major DNA-binding protein of the erythroid lineage through expression in mammalian cells , 1989, Nature.

[9]  P. Sharp,et al.  Five intermediate complexes in transcription initiation by RNA polymerase II , 1989, Cell.

[10]  S. Orkin,et al.  Erythroid differentiation in chimaeric mice blocked by a targeted mutation in the gene for transcription factor GATA-1 , 1991, Nature.

[11]  R. Tjian,et al.  Transcription from a TATA-less promoter requires a multisubunit TFIID complex. , 1991, Genes & development.

[12]  Y. Kan,et al.  Cloning of Nrf1, an NF-E2-related transcription factor, by genetic selection in yeast. , 1993, Proceedings of the National Academy of Sciences of the United States of America.

[13]  A. Bird,et al.  Effects of DNA methylation on DNA-binding proteins and gene expression. , 1993, Current opinion in genetics & development.

[14]  Y. ChanJ,et al.  酵母での遺伝的選択によるNF-E2関連転写因子、Nrf1のクローニング , 1993 .

[15]  A. Ferré-D’Amaré,et al.  Structure and function of the b/HLH/Z domain of USF. , 1994, The EMBO journal.

[16]  T. Maniatis,et al.  Virus induction of human IFNβ gene expression requires the assembly of an enhanceosome , 1995, Cell.

[17]  Song Tan,et al.  Structure of serum response factor core bound to DNA , 1995, Nature.

[18]  D J Anderson,et al.  The neuron-restrictive silencer factor (NRSF): a coordinate repressor of multiple neuron-specific genes , 1995, Science.

[19]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[20]  B. Wold,et al.  Skeletal muscle determination and differentiation: story of a core regulatory network and its context. , 1996, Current opinion in cell biology.

[21]  T. Rabbitts,et al.  The LIM‐only protein Lmo2 is a bridging molecule assembling an erythroid, DNA‐binding complex which includes the TAL1, E47, GATA‐1 and Ldb1/NLI proteins , 1997, The EMBO journal.

[22]  T Lagrange,et al.  Trajectory of DNA in the RNA polymerase II transcription preinitiation complex. , 1997, Proceedings of the National Academy of Sciences of the United States of America.

[23]  H. Lodish,et al.  Ineffective erythropoiesis in Stat5a(-/-)5b(-/-) mice due to decreased survival of early erythroblasts. , 2001, Blood.

[24]  G. Wray,et al.  Abundant raw material for cis-regulatory evolution in humans. , 2002, Molecular biology and evolution.

[25]  William Stafford Noble,et al.  Matrix2png: a utility for visualizing matrix data , 2003, Bioinform..

[26]  R. Young,et al.  Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays , 2004, Nature Genetics.

[27]  G. Crooks,et al.  WebLogo: a sequence logo generator. , 2004, Genome research.

[28]  J. Stamatoyannopoulos,et al.  Discovery of functional noncoding elements by digital analysis of chromatin structure. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[29]  K. Lindblad-Toh,et al.  Systematic discovery of regulatory motifs in human promoters and 3′ UTRs by comparison of several mammals , 2005, Nature.

[30]  David E Levy,et al.  A novel role for STAT1 in regulating murine erythropoiesis: deletion of STAT1 results in overall reduction of erythroid progenitors and alters their distribution. , 2005, Blood.

[31]  Leah Barrera,et al.  A high-resolution map of active promoters in the human genome , 2005, Nature.

[32]  William Stafford Noble,et al.  Quantifying similarity between motifs , 2007, Genome Biology.

[33]  William Stafford Noble,et al.  Genome-scale mapping of DNase I sensitivity in vivo using tiling DNA microarrays , 2006, Nature Methods.

[34]  Alexander E. Kel,et al.  TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes , 2005, Nucleic Acids Res..

[35]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[36]  Ole Winther,et al.  JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update , 2007, Nucleic Acids Res..

[37]  Z. Weng,et al.  High-Resolution Mapping and Characterization of Open Chromatin across the Genome , 2008, Cell.

[38]  B. Thiers Induction of Pluripotent Stem Cells from Adult Human Fibroblasts by Defined Factors , 2008 .

[39]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[40]  Tatiana A. Tatusova,et al.  NCBI Reference Sequences: current status, policy and new initiatives , 2008, Nucleic Acids Res..

[41]  R. Sachidanandam,et al.  Post-transcriptional processing generates a diversity of 5′-modified long and short RNAs , 2009, Nature.

[42]  William Stafford Noble,et al.  Global mapping of protein-DNA interactions in vivo by digital genomic footprinting , 2009, Nature Methods.

[43]  Martha L. Bulyk,et al.  UniPROBE: an online database of protein binding microarray data on protein–DNA interactions , 2008, Nucleic Acids Res..

[44]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[45]  Juan M. Vaquerizas,et al.  A census of human transcription factors: function, expression and evolution , 2009, Nature Reviews Genetics.

[46]  M. Mann,et al.  A SILAC-based DNA protein interaction screen that identifies candidate binding proteins to functional DNA elements. , 2009, Genome research.

[47]  Lee E. Edsall,et al.  Human DNA methylomes at base resolution show widespread epigenomic differences , 2009, Nature.

[48]  Daniel E. Newburger,et al.  Diversity and Complexity in DNA Recognition by Transcription Factors , 2009, Science.

[49]  Mikael Bodén,et al.  MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..

[50]  P. Ney,et al.  USF and NF-E2 Cooperate to Regulate the Recruitment and Activity of RNA Polymerase II in the β-Globin Gene Locus* , 2010, The Journal of Biological Chemistry.

[51]  K. Pollard,et al.  Detection of nonneutral substitution rates on mammalian phylogenies. , 2010, Genome research.

[52]  Brendan MacLean,et al.  Bioinformatics Applications Note Gene Expression Skyline: an Open Source Document Editor for Creating and Analyzing Targeted Proteomics Experiments , 2022 .

[53]  Susan E Abbatiello,et al.  Effect of collision energy optimization on the measurement of peptides by selected reaction monitoring (SRM) mass spectrometry. , 2010, Analytical chemistry.

[54]  Robert B. Hartlage,et al.  This PDF file includes: Materials and Methods , 2009 .

[55]  Ting Wang,et al.  ENCODE whole-genome data in the UCSC Genome Browser , 2009, Nucleic Acids Res..

[56]  D. Altshuler,et al.  A map of human genome variation from population-scale sequencing , 2010, Nature.

[57]  Emmanouil Collab A map of human genome variation from population-scale sequencing , 2011, Nature.

[58]  Myong-Hee Sung,et al.  Transcription factor AP1 potentiates chromatin accessibility and glucocorticoid receptor binding. , 2011, Molecular cell.

[59]  Nathan C. Sheffield,et al.  The accessible chromatin landscape of the human genome , 2012, Nature.

[60]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[61]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.