Anchor: trans-cell type prediction of transcription factor binding sites

The ENCyclopedia of DNA Elements (ENCODE) consortium has generated transcription factor (TF) binding ChIP-seq data covering hundreds of TF proteins and cell types; however, due to limits on time and resources, only a small fraction of all possible TF-cell type pairs have been profiled. One solution is to build machine learning models trained on currently available epigenomic data sets that can be applied to the remaining missing pairs. A major challenge is that TF binding sites are cell-type-specific, which can be attributed to cellular contexts such as chromatin accessibility. Meanwhile, indirect TF-DNA binding and interactions between TFs complicate this regulatory process. Technical issues such as sequencing biases and batch effects render the prediction task even more challenging. Many pioneering efforts have been made to predict TF binding profiles based on DNA sequence and DNase-seq footprints, but to what extent a model can be generalized to completely untested cell conditions remains unknown. In this study, we describe our first place solution to the 2017 ENCODE-DREAM in vivo TF binding site prediction challenge. By carefully addressing multisource biases and information imbalance across cell types, we created a pipeline that significantly outperforms the current state-of-the-art methods. The proposed method is sufficiently complex enough to model nonlinear interactions between TF binding motifs and chromatin accessibility information up to 1500 bp from the genomic region of interest.

[1]  R Staden Computer methods to locate signals in nucleic acid sequences , 1984, Nucleic Acids Res..

[2]  D. S. Gross,et al.  Nuclease hypersensitive sites in chromatin. , 1988, Annual review of biochemistry.

[3]  P. V. von Hippel,et al.  Selection of DNA binding sites by regulatory proteins. , 1988, Trends in biochemical sciences.

[4]  R. Sauer,et al.  Transcription factors: structural families and principles of DNA recognition. , 1992, Annual review of biochemistry.

[5]  M Kanehisa,et al.  An assessment of neural network and statistical approaches for prediction of E. coli promoter sites. , 1992, Nucleic acids research.

[6]  L. Kobierski,et al.  Activating transcription factor-3 stimulates 3',5'-cyclic adenosine monophosphate-dependent gene expression. , 1994, Molecular endocrinology.

[7]  D. S. Fields,et al.  Specificity, free energy and information content in protein-DNA interactions. , 1998, Trends in biochemical sciences.

[8]  Gary D. Stormo,et al.  DNA binding sites: representation and discovery , 2000, Bioinform..

[9]  David M. Livingston,et al.  A Complex with Chromatin Modifiers That Occupies E2F- and Myc-Responsive Genes in G0 Cells , 2002, Science.

[10]  S. Shoelson,et al.  Diabetes mutations delineate an atypical POU domain in HNF-1alpha. , 2002, Molecular cell.

[11]  A. Shilatifard,et al.  The RNA polymerase II elongation complex. , 2003, Annual review of biochemistry.

[12]  M. Bulyk Computational prediction of transcription-factor binding site locations , 2003, Genome Biology.

[13]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[14]  J. Stamatoyannopoulos,et al.  Genome-wide identification of DNaseI hypersensitive sites using active chromatin sequence libraries. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[15]  Ling V. Sun,et al.  Hotspots of transcription factor colocalization in the genome of Drosophila melanogaster , 2006, Proceedings of the National Academy of Sciences.

[16]  M. Daly,et al.  Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). , 2005, Genome research.

[17]  F. Robert,et al.  Genome-wide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression , 2006 .

[18]  C. Chiang,et al.  The General Transcription Machinery and General Cofactors , 2006, Critical reviews in biochemistry and molecular biology.

[19]  S. Burley,et al.  Structural aspects of interactions within the Myc/Max/Mad network. , 2006, Current topics in microbiology and immunology.

[20]  Stuart H. Orkin,et al.  A protein interaction network for pluripotency of embryonic stem cells , 2006, Nature.

[21]  G. Felsenfeld,et al.  Insulators: exploiting transcriptional and epigenetic mechanisms , 2006, Nature Reviews Genetics.

[22]  D. Guhathakurta,et al.  Computational identification of transcriptional regulatory elements in DNA sequence , 2006, Nucleic acids research.

[23]  A. Califano,et al.  Dialogue on Reverse‐Engineering Assessment and Methods , 2007, Annals of the New York Academy of Sciences.

[24]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[25]  Raymond C Stevens,et al.  Crystal structure and DNA binding of the homeodomain of the stem cell transcription factor Nanog. , 2008, Journal of molecular biology.

[26]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[27]  William Stafford Noble,et al.  Global mapping of protein-DNA interactions in vivo by digital genomic footprinting , 2009, Nature Methods.

[28]  Daniel E. Newburger,et al.  Diversity and Complexity in DNA Recognition by Transcription Factors , 2009, Science.

[29]  Vsevolod J. Makeev,et al.  Deep and wide digging for binding motifs in ChIP-Seq data , 2010, Bioinform..

[30]  M. Gerstein,et al.  Variation in Transcription Factor Binding Among Humans , 2010, Science.

[31]  Jacob F. Degner,et al.  Sequence and Chromatin Accessibility Data Accurate Inference of Transcription Factor Binding from Dna Material Supplemental Open Access , 2022 .

[32]  Pengzhi Yu,et al.  Dynamic chromatin states in human ES cells reveal potential regulatory sequences and genes involved in pluripotency , 2011, Cell Research.

[33]  E. Birney,et al.  High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells. , 2011, Genome research.

[34]  Nir Friedman,et al.  A high-throughput chromatin immunoprecipitation approach reveals principles of dynamic gene regulation in mammals. , 2012, Molecular cell.

[35]  David Z. Chen,et al.  Architecture of the human regulatory network derived from ENCODE data , 2012, Nature.

[36]  William Stafford Noble,et al.  Epigenetic priors for identifying active transcription factor binding sites , 2012, Bioinform..

[37]  Shane J. Neph,et al.  An expansive human regulatory lexicon encoded in transcription factor footprints , 2012, Nature.

[38]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[39]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[40]  A. Kornblihtt CTCF: from insulators to alternative splicing regulation , 2012, Cell Research.

[41]  Thomas Whitington,et al.  Transcription Factor Binding in Human Cells Occurs in Dense Clusters Formed around Cohesin Anchor Sites , 2013, Cell.

[42]  Thomas A. Down,et al.  Chromatin Accessibility Data Sets Show Bias Due to Sequence Specificity of the DNase I Enzyme , 2013, PloS one.

[43]  R. Sandstrom,et al.  Probing DNA shape and methylation state on a genomic scale with DNase I , 2013, Proceedings of the National Academy of Sciences.

[44]  Alex P. Reynolds,et al.  Genome-scale mapping of DNase I hypersensitivity. , 2013, Current protocols in molecular biology.

[45]  Jason Piper,et al.  Wellington: a novel method for the accurate identification of digital genomic footprints from DNase-seq data , 2013, Nucleic acids research.

[46]  Maureen A. Sartor,et al.  PePr: a peak-calling prioritization pipeline to identify consistent or differential peaks from replicated ChIP-Seq data , 2014, Bioinform..

[47]  Ivan G. Costa,et al.  Detection of active transcription factor binding sites with the combination of DNase hypersensitivity and histone modifications , 2014, Bioinform..

[48]  Myong-Hee Sung,et al.  DNase footprint signatures are dictated by factor dynamics and DNA sequence. , 2014, Molecular cell.

[49]  Uwe Ohler,et al.  Explicit DNase sequence bias modeling enables high-resolution transcription factor footprint detection , 2014, Nucleic acids research.

[50]  Tatsunori B. Hashimoto,et al.  Discovery of non-directional and directional pioneer transcription factors by modeling DNase profile magnitude and shape , 2014, Nature Biotechnology.

[51]  R. Gordân,et al.  Protein–DNA binding: complexities and multi-protein codes , 2013, Nucleic acids research.

[52]  Harri Lähdesmäki,et al.  BinDNase: a discriminatory approach for transcription factor binding prediction using DNase I hypersensitivity data , 2015, Bioinform..

[53]  Jens Keilwagen,et al.  PRROC: computing and visualizing precision-recall and receiver operating characteristic curves in R , 2015, Bioinform..

[54]  Howard Y. Chang,et al.  ATAC‐seq: A Method for Assaying Chromatin Accessibility Genome‐Wide , 2015, Current protocols in molecular biology.

[55]  Lennart Nilsson,et al.  Structural insights into the DNA-binding specificity of E2F family transcription factors , 2015, Nature Communications.

[56]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[57]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[58]  Michael Q. Zhang,et al.  Integrative analysis of 111 reference human epigenomes , 2015, Nature.

[59]  David R. Kelley,et al.  Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks , 2015, bioRxiv.

[60]  Justin Crocker,et al.  The Soft Touch: Low-Affinity Transcription Factor Binding Sites in Development and Evolution. , 2016, Current topics in developmental biology.

[61]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[62]  Xiaohui S. Xie,et al.  DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences , 2015, bioRxiv.

[63]  E. Nogales,et al.  Structure of promoter-bound TFIID and model of human pre-initiation complex assembly , 2016, Nature.

[64]  E. Gusmão,et al.  Analysis of computational footprinting methods for DNase sequencing experiments , 2016, Nature Methods.

[65]  Terrence S. Furey,et al.  DeFCoM: analysis and modeling of transcription factor binding sites using a motif‐centric genomic footprinter , 2016, Bioinform..

[66]  Marcel H. Schulz,et al.  Combining transcription factor binding affinities with open-chromatin data for accurate gene expression prediction , 2016, bioRxiv.

[67]  Yanli Wang,et al.  Molecular mechanism of directional CTCF recognition of a diverse range of genomic sites , 2017, Cell Research.

[68]  Stefan Posch,et al.  Learning from mistakes: Accurate prediction of cell type-specific transcription factor binding , 2017, bioRxiv.

[69]  M. Bulyk,et al.  Transcription factor-DNA binding: beyond binding site motifs. , 2017, Current opinion in genetics & development.

[70]  Daniel Quang,et al.  FactorNet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data , 2017, bioRxiv.

[71]  E. Morgunova,et al.  Structural perspective of cooperative transcription factor binding. , 2017, Current opinion in structural biology.

[72]  Nicholas Carriero,et al.  Mocap: large-scale inference of transcription factor binding sites from chromatin accessibility , 2016, bioRxiv.

[73]  Deepak Kumar,et al.  Abstract 5446: Serotonin modulates AKT-mTOR and Notch signaling pathways, promotes liver cancer cell steatosis and cell survival , 2018, Molecular and Cellular Biology / Genetics.

[74]  Yuanfang Guan,et al.  Network Propagation Predicts Drug Synergy in Cancers. , 2018, Cancer research.

[75]  Edgar Wingender,et al.  TFClass: expanding the classification of human transcription factors to their mammalian orthologs , 2017, Nucleic Acids Res..

[76]  André L. Martins,et al.  Universal correction of enzymatic sequence bias reveals molecular signatures of protein/DNA interactions , 2017, bioRxiv.

[77]  Jun Cheng,et al.  Kipoi: accelerating the community exchange and reuse of predictive models for genomics , 2018, bioRxiv.

[78]  Yuanfang Guan,et al.  Accurate prediction of personalized olfactory perception from large-scale chemoinformatic features , 2017, GigaScience.

[79]  F. A. Kolpakov,et al.  HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis , 2017, Nucleic Acids Res..

[80]  Yuanfang Guan,et al.  TAIJI: approaching experimental replicates-level accuracy for drug synergy prediction , 2018, Bioinform..