Imputation for transcription factor binding predictions based on deep learning

Understanding the cell-specific binding patterns of transcription factors (TFs) is fundamental to studying gene regulatory networks in biological systems, for which ChIP-seq not only provides valuable data but is also considered as the gold standard. Despite tremendous efforts from the scientific community to conduct TF ChIP-seq experiments, the available data represent only a limited percentage of ChIP-seq experiments, considering all possible combinations of TFs and cell lines. In this study, we demonstrate a method for accurately predicting cell-specific TF binding for TF-cell line combinations based on only a small fraction (4%) of the combinations using available ChIP-seq data. The proposed model, termed TFImpute, is based on a deep neural network with a multi-task learning setting to borrow information across transcription factors and cell lines. Compared with existing methods, TFImpute achieves comparable accuracy on TF-cell line combinations with ChIP-seq data; moreover, TFImpute achieves better accuracy on TF-cell line combinations without ChIP-seq data. This approach can predict cell line specific enhancer activities in K562 and HepG2 cell lines, as measured by massively parallel reporter assays, and predicts the impact of SNPs on TF binding.

[1]  C. Sander,et al.  Genome-wide analysis of non-coding regulatory mutations in cancer , 2014, Nature Genetics.

[2]  Razvan Pascanu,et al.  Theano: new features and speed improvements , 2012, ArXiv.

[3]  Gary D. Stormo,et al.  Modeling the specificity of protein-DNA interactions , 2013, Quantitative Biology.

[4]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[5]  T. Bailey,et al.  High-throughput chromatin information enables accurate tissue-specific prediction of transcription factor binding sites , 2008, Nucleic acids research.

[6]  Han Xu,et al.  Analysis of optimized DNase-seq reveals intrinsic bias in transcription factor footprint identification , 2013, Nature methods.

[7]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[8]  Tatsunori B. Hashimoto,et al.  Discovery of non-directional and directional pioneer transcription factors by modeling DNase profile magnitude and shape , 2014, Nature Biotechnology.

[9]  David K. Gifford,et al.  GERV: A Statistical Method for Generative Evaluation of Regulatory Variants for Transcription Factor Binding , 2015, bioRxiv.

[10]  Raluca Gordân,et al.  Protein−DNA binding in the absence of specific base-pair recognition , 2014, Proceedings of the National Academy of Sciences.

[11]  W. Wasserman,et al.  Identification of altered cis-regulatory elements in human disease. , 2015, Trends in genetics : TIG.

[12]  Yoshua Bengio,et al.  Gradient-based learning applied to document recognition , 1998, Proc. IEEE.

[13]  Ke Deng,et al.  High-dimensional genomic data bias correction and data integration using MANCIE , 2016, Nature Communications.

[14]  O. Troyanskaya,et al.  Predicting effects of noncoding variants with deep learning–based sequence model , 2015, Nature Methods.

[15]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[16]  Morteza Mohammad Noori,et al.  Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features , 2014, PLoS Comput. Biol..

[17]  Atina G. Coté,et al.  Evaluation of methods for modeling transcription factor sequence specificity , 2013, Nature Biotechnology.

[18]  P. V. von Hippel,et al.  Increased subtlety of transcription factor binding increases complexity of genome regulation , 2014, Proceedings of the National Academy of Sciences.

[19]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[20]  Michael A. Beer,et al.  Discriminative prediction of mammalian enhancers from DNA sequence. , 2011, Genome research.

[21]  T. Mikkelsen,et al.  Systematic dissection of regulatory motifs in 2000 predicted human enhancers using a massively parallel reporter assay. , 2013, Genome research.

[22]  Razvan Pascanu,et al.  Theano: A CPU and GPU Math Compiler in Python , 2010, SciPy.

[23]  R. Gordân,et al.  Protein–DNA binding: complexities and multi-protein codes , 2013, Nucleic acids research.

[24]  Brendan J. Frey,et al.  Deep learning of the tissue-regulated splicing code , 2014, Bioinform..

[25]  Manolis Kellis,et al.  Large-scale epigenome imputation improves data quality and disease variant enrichment , 2015, Nature Biotechnology.

[26]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[27]  Dan Xie,et al.  Dynamic trans-Acting Factor Colocalization in Human Cells , 2013, Cell.

[28]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[29]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[30]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[31]  P. V. Hippel Increased subtlety of transcription factor binding increases complexity of genome regulation , 2014 .

[32]  Jie Wang,et al.  Factorbook.org: a Wiki-based database for transcription factor-binding data generated by the ENCODE consortium , 2012, Nucleic Acids Res..

[33]  Joseph K. Pickrell,et al.  DNaseI sensitivity QTLs are a major determinant of human expression variation , 2011, Nature.

[34]  Fangxue Sherry He,et al.  Systematic identification of mammalian regulatory motifs' target genes and functions , 2008, Nature Methods.

[35]  B. Frey,et al.  Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning , 2015, Nature Biotechnology.

[36]  Jason B. Ernst,et al.  Integrating multiple evidence sources to predict transcription factor binding in the human genome. , 2010, Genome research.

[37]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[38]  Swneke D. Bailey,et al.  Breast cancer risk-associated SNPs modulate the affinity of chromatin for FOXA1 and alter gene expression , 2012, Nature Genetics.

[39]  Ritwick Sawarkar,et al.  Cis-regulatory variation: significance in biomedicine and evolution , 2014, Cell and Tissue Research.

[40]  Morteza Mohammad Noori,et al.  gkmSVM: an R package for gapped-kmer SVM , 2016, Bioinform..

[41]  Benjamin J. Strober,et al.  A method to predict the impact of regulatory variants from DNA sequence , 2015, Nature Genetics.

[42]  Christopher L. Warren,et al.  A library of yeast transcription factor motifs reveals a widespread function for Rsc3 in targeting nucleosome exclusion at promoters. , 2008, Molecular cell.

[43]  Tao Liu,et al.  Cistrome Data Browser: a data portal for ChIP-Seq and chromatin accessibility data in human and mouse , 2016, Nucleic Acids Res..

[44]  David J. Arenillas,et al.  JASPAR 2014: an extensively expanded and updated open-access database of transcription factor binding profiles , 2013, Nucleic Acids Res..

[45]  Jacob F. Degner,et al.  Sequence and Chromatin Accessibility Data Accurate Inference of Transcription Factor Binding from Dna Material Supplemental Open Access , 2022 .