Combining transcription factor binding affinities with open-chromatin data for accurate gene expression prediction

The binding and contribution of transcription factors (TF) to cell specific gene expression is often deduced from open-chromatin measurements to avoid costly TF ChIP-seq assays. Thus, it is important to develop computational methods for accurate TF binding prediction in open-chromatin regions (OCRs). Here, we report a novel segmentation-based method, TEPIC, to predict TF binding by combining sets of OCRs with position weight matrices. TEPIC can be applied to various open-chromatin data, e.g. DNaseI-seq and NOMe-seq. Additionally, Histone-Marks (HMs) can be used to identify candidate TF binding sites. TEPIC computes TF affinities and uses open-chromatin/HM signal intensity as quantitative measures of TF binding strength. Using machine learning, we find low affinity binding sites to improve our ability to explain gene expression variability compared to the standard presence/absence classification of binding sites. Further, we show that both footprints and peaks capture essential TF binding events and lead to a good prediction performance. In our application, gene-based scores computed by TEPIC with one open-chromatin assay nearly reach the quality of several TF ChIP-seq datasets. Finally, these scores correctly predict known transcriptional regulators as illustrated by the application to novel DNaseI-seq and NOMe-seq data for primary human hepatocytes and CD4+ T-cells, respectively.

[1]  Harri Lähdesmäki,et al.  BinDNase: a discriminatory approach for transcription factor binding prediction using DNase I hypersensitivity data , 2015, Bioinform..

[2]  N. D. Clarke,et al.  Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells , 2008, Cell.

[3]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[4]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[5]  L. Sussel,et al.  Unique functions of Gata4 in mouse liver induction and heart development. , 2016, Developmental biology.

[6]  R. Mann,et al.  Low Affinity Binding Site Clusters Confer Hox Specificity and Regulatory Robustness , 2015, Cell.

[7]  Jeff A. Bilmes,et al.  A dynamic Bayesian network for identifying protein-binding footprints from single molecule-based sequencing data , 2010, Bioinform..

[8]  Ivan G. Costa,et al.  Detection of active transcription factor binding sites with the combination of DNase hypersensitivity and histone modifications , 2014, Bioinform..

[9]  G. Hon,et al.  Predictive chromatin signatures in the mammalian genome. , 2009, Human molecular genetics.

[10]  Amos Tanay,et al.  Extensive low-affinity transcriptional interactions in the yeast genome. , 2006, Genome research.

[11]  C. T. Hotta,et al.  Co-expression network analysis reveals transcription factors associated to cell wall biosynthesis in sugarcane , 2016, Plant Molecular Biology.

[12]  Martin Vingron,et al.  Predicting transcription factor affinities to DNA from a biophysical model , 2007, Bioinform..

[13]  Thomas A. Down,et al.  Chromatin Accessibility Data Sets Show Bias Due to Sequence Specificity of the DNase I Enzyme , 2013, PloS one.

[14]  Timothy L. Bailey,et al.  Genome-wide in silico prediction of gene expression , 2012, Bioinform..

[15]  R. Costa,et al.  Transcription factors in liver development, differentiation, and regeneration , 2003, Hepatology.

[16]  Myong-Hee Sung,et al.  DNase footprint signatures are dictated by factor dynamics and DNA sequence. , 2014, Molecular cell.

[17]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[18]  William Stafford Noble,et al.  Epigenetic priors for identifying active transcription factor binding sites , 2012, Bioinform..

[19]  Gangning Liang,et al.  Genome-wide mapping of nucleosome positioning and DNA methylation within individual DNA molecules , 2012, Genome research.

[20]  Jacob F. Degner,et al.  Sequence and Chromatin Accessibility Data Accurate Inference of Transcription Factor Binding from Dna Material Supplemental Open Access , 2022 .

[21]  Shane J. Neph,et al.  An expansive human regulatory lexicon encoded in transcription factor footprints , 2012, Nature.

[22]  Melissa J. Davis,et al.  Predicting expression: the complementary power of histone modification and transcription factor binding data , 2014, Epigenetics & Chromatin.

[23]  A. Diehl,et al.  Roles of CCAAT/Enhancer-binding Proteins in Regulation of Liver Regenerative Growth* , 1998, The Journal of Biological Chemistry.

[24]  Wouter de Laat,et al.  CTCF: the protein, the binding partners, the binding sites and their chromatin loops , 2013, Philosophical Transactions of the Royal Society B: Biological Sciences.

[25]  Aaron R. Quinlan,et al.  Bioinformatics Applications Note Genome Analysis Bedtools: a Flexible Suite of Utilities for Comparing Genomic Features , 2022 .

[26]  Jens Keilwagen,et al.  PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R , 2015, Bioinform..

[27]  Lino Tessarollo,et al.  The zinc finger transcription factor Zbtb7b represses CD8-lineage gene expression in peripheral CD4+ T cells. , 2008, Immunity.

[28]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[29]  Martha L. Bulyk,et al.  UniPROBE, update 2015: new tools and content for the online database of protein-binding microarray data on protein–DNA interactions , 2014, Nucleic Acids Res..

[30]  William Stafford Noble,et al.  FIMO: scanning for occurrences of a given motif , 2011, Bioinform..

[31]  Uwe Ohler,et al.  JAMM: a peak finder for joint analysis of NGS replicates , 2015, Bioinform..

[32]  Lior Pachter,et al.  Sequence Analysis , 2020, Definitions.

[33]  V. Pant,et al.  CTCF-binding sites within the H19 ICR differentially regulate local chromatin structures and cis-acting functions , 2012, Epigenetics.

[34]  Uwe Ohler,et al.  Explicit DNase sequence bias modeling enables high-resolution transcription factor footprint detection , 2014, Nucleic acids research.

[35]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[36]  Helge G. Roider,et al.  Transcription factor binding predictions using TRAP for the analysis of ChIP-seq data and regulatory SNPs , 2011, Nature Protocols.

[37]  Francisco de A. T. de Carvalho,et al.  Predicting gene expression in T cell differentiation from histone modifications and transcription factor binding affinities by linear mixture models , 2011, BMC Bioinformatics.

[38]  Raymond K. Auerbach,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[39]  Guoping Fan,et al.  Signed weighted gene co-expression network analysis of transcriptional regulation in murine embryonic stem cells , 2009, BMC Genomics.

[40]  Andrea Masotti,et al.  Telomere shortening and telomere position effect in mild ring 17 syndrome , 2014, Epigenetics & Chromatin.

[41]  J. Bories,et al.  The Ets-1 transcription factor is required for complete pre-T cell receptor function and allelic exclusion at the T cell receptor beta locus. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[42]  Nathan C. Sheffield,et al.  Predicting cell-type–specific gene expression from regions of open chromatin , 2012, Genome research.

[43]  Edmund J. Crampin,et al.  Predictive modelling of gene expression from transcriptional regulatory elements , 2015, Briefings Bioinform..

[44]  Tatsunori B. Hashimoto,et al.  Discovery of non-directional and directional pioneer transcription factors by modeling DNase profile magnitude and shape , 2014, Nature Biotechnology.

[45]  S. Elgin,et al.  DNase I hypersensitive sites in Drosophila chromatin occur at the 5' ends of regions of transcription. , 1981, Proceedings of the National Academy of Sciences of the United States of America.

[46]  William Stafford Noble,et al.  Global mapping of protein-DNA interactions in vivo by digital genomic footprinting , 2009, Nature Methods.

[47]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[48]  David J. Arenillas,et al.  JASPAR 2016: a major expansion and update of the open-access database of transcription factor binding profiles , 2015, Nucleic Acids Res..

[49]  R. Kornberg The molecular basis of eukaryotic transcription , 2007, Proceedings of the National Academy of Sciences.

[50]  Emery H. Bresnick,et al.  Integration of Hi-C and ChIP-seq data reveals distinct types of chromatin linkages , 2012, Nucleic acids research.

[51]  Howard Y. Chang,et al.  ATAC‐seq: A Method for Assaying Chromatin Accessibility Genome‐Wide , 2015, Current protocols in molecular biology.

[52]  D. Galas,et al.  DNAse footprinting: a simple method for the detection of protein-DNA binding specificity. , 1978, Nucleic acids research.

[53]  Candy S. Lee,et al.  Liver-specific inactivation of the Nrf1 gene in adult mouse leads to nonalcoholic steatohepatitis and hepatic neoplasia. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[54]  Yi-jun Zhu,et al.  Peroxisome proliferator-activated receptors, coactivators, and downstream targets , 2007, Cell Biochemistry and Biophysics.

[55]  B. Ren,et al.  Genome-wide prediction of transcription factor binding sites using an integrated model , 2010, Genome Biology.

[56]  Juan M. Vaquerizas,et al.  A census of human transcription factors: function, expression and evolution , 2009, Nature Reviews Genetics.

[57]  R. Young,et al.  Histone H3K27ac separates active from poised enhancers and predicts developmental state , 2010, Proceedings of the National Academy of Sciences.

[58]  H. Zou,et al.  Regularization and variable selection via the elastic net , 2005 .

[59]  Alexander E. Kel,et al.  TRANSFAC® and its module TRANSCompel®: transcriptional gene regulation in eukaryotes , 2005, Nucleic Acids Res..

[60]  C. Constantinescu,et al.  IL-12 inhibits glucocorticoid-induced T cell apoptosis by inducing GMEB1 and activating PI3K/Akt pathway. , 2012, Immunobiology.

[61]  Alena van Bömmel,et al.  Prediction of transcription factor co-occurrence using rank based statistics , 2015 .

[62]  Martin Vingron,et al.  PASTAA: identifying transcription factors associated with sets of co-regulated genes , 2008, Bioinform..

[63]  Thomas Lengauer,et al.  A general concept for consistent documentation of computational analyses , 2015, Database J. Biol. Databases Curation.

[64]  Hanfei Sun,et al.  Target analysis by integration of transcriptome and ChIP-seq data with BETA , 2013, Nature Protocols.

[65]  Alexander J. Hartemink,et al.  Using DNase Digestion Data to Accurately Identify Transcription Factor Binding Sites , 2012, Pacific Symposium on Biocomputing.

[66]  W. Wong,et al.  ChIP-Seq of transcription factors predicts absolute and differential gene expression in embryonic stem cells , 2009, Proceedings of the National Academy of Sciences.

[67]  Neva C. Durand,et al.  A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping , 2014, Cell.

[68]  G. Hong,et al.  Nucleic Acids Research , 2015, Nucleic Acids Research.

[69]  Thomas Lengauer,et al.  BLUEPRINT to decode the epigenetic signature written in blood , 2012, Nature Biotechnology.

[70]  E. Birney,et al.  High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells. , 2011, Genome research.

[71]  E. Gusmão,et al.  Analysis of computational footprinting methods for DNase sequencing experiments , 2016, Nature Methods.

[72]  Vladimir B. Bajic,et al.  HOCOMOCO: expansion and enhancement of the collection of transcription factor binding sites models , 2015, Nucleic Acids Res..

[73]  Michael Q. Zhang,et al.  Integrative analysis of 111 reference human epigenomes , 2015, Nature.

[74]  Jason Piper,et al.  Wellington: a novel method for the accurate identification of digital genomic footprints from DNase-seq data , 2013, Nucleic acids research.