Signed iterative random forests to identify enhancer-associated transcription factor binding

Standard ChIP-seq peak calling pipelines seek to differentiate biochemically reproducible signals of individual genomic elements from background noise. However, reproducibility alone does not imply functional regulation (e.g., enhancer activation, alternative splicing). Here we present a general-purpose, interpretable machine learning method: signed iterative random forests (siRF), which we use to infer regulatory interactions among transcription factors and functional binding signatures surrounding enhancer elements in Drosophila melanogaster.

[1]  L. Breiman Random Forests , 2001, Machine Learning.

[2]  Bin Yu,et al.  Three principles of data science: predictability, computability, and stability (PCS) , 2018, 2018 IEEE International Conference on Big Data (Big Data).

[3]  Xiaoyan Zhang,et al.  Cistrome Data Browser: expanded datasets and new tools for gene regulatory analysis , 2018, Nucleic Acids Res..

[4]  Donald Geman,et al.  Digitizing omics profiles by divergence from a baseline , 2018, Proceedings of the National Academy of Sciences.

[5]  Sumanta Basu,et al.  Exploiting regulatory heterogeneity to systematically identify enhancers with high accuracy , 2018, Proceedings of the National Academy of Sciences.

[6]  Mark Gerstein,et al.  The ModERN Resource: Genome-Wide Binding Profiles for Hundreds of Drosophila and Caenorhabditis elegans Transcription Factors , 2017, Genetics.

[7]  Gamaleldin F. Elsayed,et al.  Structure in neural population recordings: an expected byproduct of simpler phenomena? , 2017, Nature Neuroscience.

[8]  James B. Brown,et al.  Iterative random forests to discover predictive and stable high-order interactions , 2017, Proceedings of the National Academy of Sciences.

[9]  Doron Lancet,et al.  GeneHancer: genome-wide integration of enhancers and target genes in GeneCards , 2017, Database J. Biol. Databases Curation.

[10]  J. Malley,et al.  Detecting gene-gene interactions using a permutation-based random forest method , 2016, BioData Mining.

[11]  Siqi Wu,et al.  Stability-driven nonnegative matrix factorization to interpret spatial gene expression and build local gene networks , 2016, Proceedings of the National Academy of Sciences.

[12]  Obi L. Griffith,et al.  ORegAnno 3.0: a community-driven resource for curated regulatory annotation , 2015, Nucleic Acids Res..

[13]  Michael Q. Zhang,et al.  Integrative analysis of 111 reference human epigenomes , 2015, Nature.

[14]  Hernan G. Garcia,et al.  Dynamic regulation of eve stripe 2 expression reveals transcriptional bursts in living Drosophila embryos , 2014, Proceedings of the National Academy of Sciences.

[15]  Melissa M. Harrison,et al.  Establishment of regions of genomic activity during the Drosophila maternal to zygotic transition , 2014, bioRxiv.

[16]  B. Dickson,et al.  Genome-scale functional characterization of Drosophila developmental enhancers in vivo , 2014, Nature.

[17]  A. Sandelin,et al.  Molecular Architecture of Transcription Factor Hotspots in Early Adipogenesis , 2014, Cell reports.

[18]  Manolis Kellis,et al.  Spatial expression of transcription factors in Drosophila embryonic organ development , 2013, Genome Biology.

[19]  Steven Henikoff,et al.  High-resolution mapping of transcription factor binding sites on native chromatin , 2013, Epigenetics & Chromatin.

[20]  Rajen Dinesh Shah,et al.  Random intersection trees , 2013, J. Mach. Learn. Res..

[21]  David H. Sharp,et al.  Rearrangements of 2.5 Kilobases of Noncoding DNA from the Drosophila even-skipped Locus Define Predictive Rules of Genomic cis-Regulatory Logic , 2013, PLoS genetics.

[22]  James B. Brown,et al.  DNA regions bound at low occupancy by transcription factors do not drive patterned reporter gene expression in Drosophila , 2012, Proceedings of the National Academy of Sciences.

[23]  William Stafford Noble,et al.  Integrative annotation of chromatin elements from ENCODE data , 2012, Nucleic acids research.

[24]  David Z. Chen,et al.  Architecture of the human regulatory network derived from ENCODE data , 2012, Nature.

[25]  Marc D. Perry,et al.  ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia , 2012, Genome research.

[26]  E. Furlong,et al.  Transcription factors: from enhancer binding to developmental control , 2012, Nature Reviews Genetics.

[27]  Raymond K. Auerbach,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[28]  Peter J. Bickel,et al.  Measuring reproducibility of high-throughput experiments , 2011, 1110.4705.

[29]  B. Graveley The developmental transcriptome of Drosophila melanogaster , 2010, Nature.

[30]  Johannes Jaeger,et al.  Cellular and Molecular Life Sciences REVIEW The gap gene network , 2022 .

[31]  James B. Brown,et al.  Developmental roles of 21 Drosophila transcription factors are determined by quantitative differences in binding to an overlapping set of thousands of genomic regions , 2009, Genome Biology.

[32]  M. Gerstein,et al.  Unlocking the secrets of the genome , 2009, Nature.

[33]  Richard Weiszmann,et al.  Determination of gene expression patterns using high-throughput RNA in situ hybridization to whole-mount Drosophila embryos , 2009, Nature Protocols.

[34]  P. Park,et al.  Design and analysis of ChIP-seq experiments for DNA-binding proteins , 2008, Nature Biotechnology.

[35]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[36]  M. Metzstein,et al.  The zinc-finger protein Zelda is a key activator of the early zygotic genome in Drosophila , 2008, Nature.

[37]  D. W. Knowles,et al.  Transcription Factors Bind Thousands of Active and Inactive Regions in the Drosophila Blastoderm , 2008, PLoS biology.

[38]  G. Rubin,et al.  Global analysis of patterns of gene expression during Drosophila embryogenesis , 2007, Genome Biology.

[39]  Ling V. Sun,et al.  Hotspots of transcription factor colocalization in the genome of Drosophila melanogaster , 2006, Proceedings of the National Academy of Sciences.

[40]  M. Ashburner,et al.  Systematic determination of patterns of gene expression during Drosophila embryogenesis , 2002, Genome Biology.

[41]  G. Morata,et al.  Caudal is the Hox gene that specifies the most posterior Drosophile segment , 1999, Nature.

[42]  E. Steingrímsson,et al.  Dual role of the Drosophila pattern gene tailless in embryonic termini. , 1991, Science.

[43]  H. Jäckle,et al.  Gene expression mediated by cis‐acting sequences of the Krüppel gene in response to the Drosophila morphogens bicoid and hunchback. , 1991, The EMBO journal.

[44]  M. Levine,et al.  Mutually repressive interactions between the gap genes giant and Krüppel define middle body regions of the Drosophila embryo. , 1991, Development.

[45]  M. Levine,et al.  Dose-dependent regulation of pair-rule stripes by gap proteins and the initiation of segment polarity. , 1990, Development.

[46]  D. Tautz,et al.  A non-radioactive in situ hybridization method for the localization of specific RNAs in Drosophila embryos reveals translational control of the segmentation gene hunchback , 1989, Chromosoma.

[47]  H. Jäckle,et al.  Pole region-dependent repression of the Drosophila gap gene Krüppel by maternal gene products , 1987, Cell.

[48]  P. Simpson,et al.  Maternal-Zygotic Gene Interactions during Formation of the Dorsoventral Pattern in Drosophila Embryos. , 1983, Genetics.

[49]  L. Wolpert Positional information and the spatial pattern of cellular differentiation. , 1969, Journal of theoretical biology.

[50]  Melissa C. Greven,et al.  An integrated encyclopedia of DNA elements in the human genome , 2014 .

[51]  Christopher D. Brown,et al.  Identification of Functional Elements and Regulatory Circuits by Drosophila modENCODE , 2010, Science.