High Resolution Genome Wide Binding Event Finding and Motif Discovery Reveals Transcription Factor Spatial Binding Constraints

An essential component of genome function is the syntax of genomic regulatory elements that determine how diverse transcription factors interact to orchestrate a program of regulatory control. A precise characterization of in vivo spacing constraints between key transcription factors would reveal key aspects of this genomic regulatory language. To discover novel transcription factor spatial binding constraints in vivo, we developed a new integrative computational method, genome wide event finding and motif discovery (GEM). GEM resolves ChIP data into explanatory motifs and binding events at high spatial resolution by linking binding event discovery and motif discovery with positional priors in the context of a generative probabilistic model of ChIP data and genome sequence. GEM analysis of 63 transcription factors in 214 ENCODE human ChIP-Seq experiments recovers more known factor motifs than other contemporary methods, and discovers six new motifs for factors with unknown binding specificity. GEM's adaptive learning of binding-event read distributions allows it to further improve upon previous methods for processing ChIP-Seq and ChIP-exo data to yield unsurpassed spatial resolution and discovery of closely spaced binding events of the same factor. In a systematic analysis of in vivo sequence-specific transcription factor binding using GEM, we have found hundreds of spatial binding constraints between factors. GEM found 37 examples of factor binding constraints in mouse ES cells, including strong distance-specific constraints between Klf4 and other key regulatory factors. In human ENCODE data, GEM found 390 examples of spatially constrained pair-wise binding, including such novel pairs as c-Fos:c-Jun/USF1, CTCF/Egr1, and HNF4A/FOXA1. The discovery of new factor-factor spatial constraints in ChIP data is significant because it proposes testable models for regulatory factor interactions that will help elucidate genome function and the implementation of combinatorial control.

[1]  Wyeth W. Wasserman,et al.  JASPAR: an open-access database for eukaryotic transcription factor binding profiles , 2004, Nucleic Acids Res..

[2]  J. N. Mark Glover,et al.  Crystal structure of the heterodimeric bZIP transcription factor c-Fos–c-Jun bound to DNA , 1995, Nature.

[3]  Graziano Pesole,et al.  An algorithm for finding signals of unknown length in DNA sequences , 2001, ISMB.

[4]  Alexander J. Hartemink,et al.  Informative priors based on transcription factor structural class improve de novo motif discovery , 2006, ISMB.

[5]  Michael Q. Zhang,et al.  A highly efficient and effective motif discovery method for ChIP-seq/ChIP-chip data using positional information , 2011, Nucleic acids research.

[6]  Sayaka Sekiya,et al.  Direct conversion of mouse fibroblasts to hepatocyte-like cells by defined factors , 2011, Nature.

[7]  P. Park ChIP–seq: advantages and challenges of a maturing technology , 2009, Nature Reviews Genetics.

[8]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[9]  R. Young,et al.  Histone H3K27ac separates active from poised enhancers and predicts developmental state , 2010, Proceedings of the National Academy of Sciences.

[10]  Nir Friedman,et al.  A Simple Hyper-Geometric Approach for Discovering Putative Transcription Factor Binding Sites , 2001, WABI.

[11]  Cole Trapnell,et al.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome , 2009, Genome Biology.

[12]  A. Visel,et al.  ChIP-seq accurately predicts tissue-specific activity of enhancers , 2009, Nature.

[13]  Raja Jothi,et al.  Genome-wide identification of in vivo protein–DNA binding sites from ChIP-Seq data , 2008, Nucleic acids research.

[14]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[15]  D. Tenen,et al.  Functional characterization of the promoter for the gene encoding human eosinophil peroxidase. , 1994, The Journal of biological chemistry.

[16]  Alexander E. Kel,et al.  TRANSFAC®: transcriptional regulation, from patterns to profiles , 2003, Nucleic Acids Res..

[17]  C. Wolberger,et al.  Multiprotein-DNA complexes in transcriptional regulation. , 1999, Annual review of biophysics and biomolecular structure.

[18]  A. Philippakis,et al.  Compact, universal DNA microarrays to comprehensively determine transcription-factor binding site specificities , 2006, Nature Biotechnology.

[19]  Alfred V. Aho,et al.  Efficient string matching , 1975, Commun. ACM.

[20]  Xi Chen,et al.  Reciprocal Transcriptional Regulation of Pou5f1 and Sox2 via the Oct4/Sox2 Complex in Embryonic Stem Cells , 2005, Molecular and Cellular Biology.

[21]  Emmanuel Barillot,et al.  De novo motif identification improves the accuracy of predicting transcription factor binding sites in ChIP-Seq data analysis , 2010, Nucleic acids research.

[22]  Charles Elkan,et al.  Fitting a Mixture Model By Expectation Maximization To Discover Motifs In Biopolymer , 1994, ISMB.

[23]  Markella Ponticos,et al.  Regulation of Collagen Type I in Vascular Smooth Muscle Cells by Competition between Nkx2.5 and δEF1/ZEB1 , 2004, Molecular and Cellular Biology.

[24]  Hiroyoshi Ariga,et al.  Cross-family interaction between the bHLHZip USF and bZip Fra1 proteins results in down-regulation of AP1 activity , 1997, Oncogene.

[25]  R. Eisenman,et al.  Max: a helix-loop-helix zipper protein that forms a sequence-specific DNA-binding complex with Myc. , 1991, Science.

[26]  Martin C. Frith,et al.  Inferring transcription factor complexes from ChIP-seq data , 2011, Nucleic acids research.

[27]  P. Farnham Insights from genomic profiling of transcription factors , 2009, Nature Reviews Genetics.

[28]  Ernest Fraenkel,et al.  Practical Strategies for Discovering Regulatory DNA Sequence Motifs , 2006, PLoS Comput. Biol..

[29]  Jeannie T. Lee,et al.  Identification of a Ctcf cofactor, Yy1, for the X chromosome binary switch. , 2007, Molecular cell.

[30]  Anil K. Jain,et al.  Unsupervised Learning of Finite Mixture Models , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[31]  Zhaohui S. Qin,et al.  On the detection and refinement of transcription factor binding sites using ChIP-Seq data , 2010, Nucleic acids research.

[32]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[33]  Yuriy L Orlov,et al.  The nuclear receptor Nr5a2 can replace Oct4 in the reprogramming of murine somatic cells to pluripotent cells. , 2010, Cell stem cell.

[34]  B. Pugh,et al.  Comprehensive Genome-wide Protein-DNA Interactions Detected at Single-Nucleotide Resolution , 2011, Cell.

[35]  Jan Komorowski,et al.  Molecular interactions between HNF4a, FOXA2 and GABP identified at regulatory DNA elements through ChIP-sequencing , 2009, Nucleic acids research.

[36]  Robert Grossman,et al.  PeakRanger: A cloud-enabled peak caller for ChIP-seq data , 2011, BMC Bioinformatics.

[37]  Kenneth M. Murphy,et al.  Batf controls the global regulators of class switch recombination in both B and T cells , 2011, Nature Immunology.

[38]  Cheng Cheng,et al.  ChIP-PaM: an algorithm to identify protein-DNA interaction using ChIP-Seq data , 2010, Theoretical Biology and Medical Modelling.

[39]  Vsevolod J. Makeev,et al.  Deep and wide digging for binding motifs in ChIP-Seq data , 2010, Bioinform..

[40]  Panayiotis V. Benos,et al.  DNA Familial Binding Profiles Made Easy: Comparison of Various Motif Alignment and Clustering Strategies , 2007, PLoS Comput. Biol..

[41]  Ernest Fraenkel,et al.  High-resolution computational models of genome binding events , 2006, Nature Biotechnology.

[42]  G. Church,et al.  Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. , 2000, Journal of molecular biology.

[43]  N. D. Clarke,et al.  Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells , 2008, Cell.

[44]  A. Visel,et al.  Homotypic clusters of transcription factor binding sites are a key component of human promoters and enhancers. , 2010, Genome research.

[45]  E. Liu,et al.  Evolution of the mammalian transcription factor binding repertoire via transposable elements. , 2008, Genome research.

[46]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[47]  Barbara Hoffman,et al.  The proto-oncogene c-myc in hematopoietic development and leukemogenesis , 2002, Oncogene.

[48]  G. Bourque,et al.  Transposable elements have rewired the core regulatory network of human embryonic stem cells , 2010, Nature Genetics.

[49]  Jun S. Liu,et al.  An algorithm for finding protein–DNA binding sites with applications to chromatin-immunoprecipitation microarray experiments , 2002, Nature Biotechnology.

[50]  S. Batzoglou,et al.  Genome-Wide Analysis of Transcription Factor Binding Sites Based on ChIP-Seq Data , 2008, Nature Methods.

[51]  Yuchun Guo,et al.  Discovering homotypic binding events at high spatial resolution , 2010, Bioinform..