A machine learning framework for the prediction of chromatin folding in Drosophila using epigenetic features

Technological advances have lead to the creation of large epigenetic datasets, including information aboutDNAbinding proteins andDNA spatial structure. Hi-C experiments have revealed that chromosomes are subdivided into sets of self-interacting domains called Topologically Associating Domains (TADs). TADs are involved in the regulation of gene expression activity, but the mechanisms of their formation are not yet fully understood. Here, we focus on machine learning methods to characterize DNA folding patterns in Drosophila based on chromatin marks across three cell lines. We present linear regression models with four types of regularization, gradient boosting, and recurrent neural networks (RNN) as tools to study chromatin folding characteristics associated with TADs given epigenetic chromatin immunoprecipitation data. The bidirectional long short-term memory RNN architecture produced the best prediction scores and identified biologically relevant features. Distribution of protein Chriz (Chromator) and histonemodificationH3K4me3were selected as themost informative features for the prediction of TADs characteristics. This approach may be adapted to any similar biological dataset of chromatin features across various cell lines and species. The code for the implemented pipeline, Hi-ChiP-ML, is publicly available: https://github.com/MichalRozenwald/Hi-ChIP-ML Subjects Bioinformatics, Computational Biology, Molecular Biology, Data Mining and Machine Learning, Data Science

[1]  David Dagan Feng,et al.  Cancer type prediction based on copy number aberration and chromatin 3D structure with convolutional neural networks , 2018, BMC Genomics.

[2]  Dustin E. Schones,et al.  High-Resolution Profiling of Histone Methylations in the Human Genome , 2007, Cell.

[3]  David R. Kelley,et al.  Predicting 3D genome folding from DNA sequence , 2019, bioRxiv.

[4]  Rui Jiang,et al.  EnContact: predicting enhancer-enhancer contacts using sequence-based deep learning model , 2019, PeerJ.

[5]  I. Amit,et al.  Comprehensive mapping of long range interactions reveals folding principles of the human genome , 2011 .

[6]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[7]  Zhen Cao,et al.  An Integrative Framework for Combining Sequence and Epigenomic Data to Predict Transcription Factor Binding Sites Using Deep Learning , 2019, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[8]  Ilya M. Flyamer,et al.  Active chromatin and transcription play a key role in chromosome partitioning into topologically associating domains , 2016, Genome research.

[9]  Xiaoman Li,et al.  H3K4me2 reliably defines transcription factor binding regions in different cells. , 2014, Genomics.

[10]  Teresa J. Feo,et al.  Structural absorption by barbule microstructures of super black bird of paradise feathers , 2018, Nature Communications.

[11]  Zhilan Li,et al.  SRHiC: A Deep Learning Model to Enhance the Resolution of Hi-C Data , 2020, Frontiers in Genetics.

[12]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[13]  H. Saumweber,et al.  Identification of the Drosophila interband-specific protein Z4 as a DNA-binding zinc-finger protein determining chromosomal structure , 2004, Journal of Cell Science.

[14]  Jade K. Forwood,et al.  Structural Characterization of a Gcn5-Related N-Acetyltransferase from Staphylococcus aureus , 2014, PloS one.

[15]  A. Tanay,et al.  Three-Dimensional Folding and Functional Organization Principles of the Drosophila Genome , 2012, Cell.

[16]  R. Young,et al.  Histone H3K27ac separates active from poised enhancers and predicts developmental state , 2010, Proceedings of the National Academy of Sciences.

[17]  J. Dekker,et al.  Condensin-Driven Remodeling of X-Chromosome Topology during Dosage Compensation , 2015, Nature.

[18]  Yong Wang,et al.  Integrating distal and proximal information to predict gene expression via a densely connected convolutional neural network , 2018, bioRxiv.

[19]  Yong Wang,et al.  Integrating distal and proximal information to predict gene expression via a densely connected convolutional neural network , 2020, Bioinform..

[20]  D. Czajkowsky,et al.  Sub-kb Hi-C in D. melanogaster reveals conserved characteristics of TADs between insect and mammalian cells , 2018, Nature Communications.

[21]  Fabian J Theis,et al.  Deep learning: new computational modelling techniques for genomics , 2019, Nature Reviews Genetics.

[22]  Min Zhu,et al.  A computational method to predict topologically associating domain boundaries combining histone Marks and sequence information , 2019, BMC Genomics.

[23]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[24]  Juan M. Vaquerizas,et al.  Chromatin Architecture Emerges during Zygotic Genome Activation Independent of Transcription , 2017, Cell.

[25]  Neva C. Durand,et al.  A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping , 2014, Cell.

[26]  Robert Patro,et al.  Identification of alternative topological domains in chromatin , 2014, Algorithms for Molecular Biology.

[27]  F. Zhu,et al.  Genome-wide association study reveals novel loci associated with body size and carcass yields in Pekin ducks , 2019, BMC Genomics.

[28]  Pau Farré,et al.  Dense neural networks for predicting chromatin conformation , 2018, BMC Bioinformatics.

[29]  B. Póczos,et al.  Predicting Enhancer-Promoter Interaction from Genomic Sequence with Deep Neural Networks , 2016, bioRxiv.

[30]  Yee Whye Teh,et al.  DeepC: Predicting chromatin interactions using megabase scaled deep neural networks and transfer learning , 2019, bioRxiv.

[31]  Shamith A. Samarajiwa,et al.  Identifying regulatory and spatial genomic architectural elements using cell type independent machine and deep learning models , 2020, bioRxiv.

[32]  Yijun Ruan,et al.  Evolutionarily Conserved Principles Predict 3D Chromatin Organization. , 2017, Molecular cell.

[33]  Jesse R. Dixon,et al.  Topological Domains in Mammalian Genomes Identified by Analysis of Chromatin Interactions , 2012, Nature.

[34]  S. Mundlos,et al.  Breaking TADs: How Alterations of Chromatin Domains Result in Disease. , 2016, Trends in genetics : TIG.

[35]  M. Gerstein,et al.  Unlocking the secrets of the genome , 2009, Nature.

[36]  Ekta Khurana,et al.  DeepMILO: a deep learning approach to predict the impact of non-coding sequence variants on 3D chromatin structure , 2020, Genome Biology.

[37]  Ryan A. Flynn,et al.  A unique chromatin signature uncovers early developmental enhancers in humans , 2011, Nature.

[38]  John Lygeros,et al.  Inference of the three-dimensional chromatin structure and its temporal behavior , 2018, ArXiv.

[39]  Alex Graves,et al.  Supervised Sequence Labelling , 2012 .

[40]  Kin Chung Lam,et al.  High-resolution TADs reveal DNA sequences underlying genome organization in flies , 2017, Nature Communications.

[41]  Aristotelis Tsirigos,et al.  Stratification of TAD boundaries reveals preferential insulation of super-enhancers by strong boundaries , 2018, Nature Communications.

[42]  V. Babenko,et al.  Genetic Organization of Interphase Chromosome Bands and Interbands in Drosophila melanogaster , 2014, PloS one.

[43]  Xin Yan,et al.  Linear Regression Analysis: Theory and Computing , 2009 .

[44]  William Stafford Noble,et al.  Nucleotide sequence and DNaseI sensitivity are predictive of 3D chromatin architecture , 2017, bioRxiv.

[45]  S. Vavrus,et al.  The influence of Arctic amplification on mid-latitude summer circulation , 2018, Nature Communications.

[46]  K. Zhao,et al.  Characterization of genome-wide enhancer-promoter interactions reveals co-expression of interacting genes and modes of higher order chromatin organization , 2012, Cell Research.

[47]  Peter H. L. Krijger,et al.  Regulation of disease-associated gene expression in the 3D genome , 2016, Nature Reviews Molecular Cell Biology.

[48]  Giovanni Bosco,et al.  Condensin II Counteracts Cohesin and RNA Polymerase II in the Establishment of 3D Chromatin Organization. , 2019, Cell reports.

[49]  P. Belokopytova,et al.  Quantitative prediction of enhancer–promoter interactions , 2019, bioRxiv.

[50]  Lovelace J. Luquette,et al.  Comprehensive analysis of the chromatin landscape in Drosophila , 2010, Nature.

[51]  Dariusz M Plewczynski,et al.  Three-dimensional Epigenome Statistical Model: Genome-wide Chromatin Looping Prediction , 2018, Scientific Reports.

[52]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[53]  W. Wong,et al.  DeepTACT: predicting 3D chromatin contacts via bootstrapping deep learning , 2019, Nucleic acids research.

[54]  M. Andrade-Navarro,et al.  7C: Computational Chromosome Conformation Capture by Correlation of ChIP-seq at CTCF motifs , 2019, BMC Genomics.

[55]  Fatima Zare,et al.  Noise cancellation using total variation for copy number variation detection , 2018, BMC Bioinformatics.

[56]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[57]  Guillaume J. Filion,et al.  Systematic Protein Location Mapping Reveals Five Principal Chromatin Types in Drosophila Cells , 2010, Cell.

[58]  R. Jiang,et al.  Prediction of enhancer-promoter interactions via natural language processing , 2018, BMC Genomics.

[59]  K. Pollard,et al.  Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin , 2016, Nature Genetics.

[60]  A. Mortazavi,et al.  Genome-Wide Mapping of in Vivo Protein-DNA Interactions , 2007, Science.

[61]  Javier Quilez,et al.  Transcription factors orchestrate dynamic interplay between genome topology and gene regulation during cell reprogramming , 2017, Nature Genetics.

[62]  Nicolae Radu Zabet,et al.  Chromatin architecture reorganization during neuronal cell differentiation in Drosophila genome. , 2019, Genome research.

[63]  Ricardo J. Miragaia,et al.  scRNA-seq assessment of the human lung, spleen, and esophagus tissue stability after cold preservation , 2019, Genome Biology.

[64]  Hairong Lv,et al.  hicGAN infers super resolution Hi-C data with generative adversarial networks , 2019, Bioinform..

[65]  Zhaohui S. Qin,et al.  Gene density, transcription, and insulators contribute to the partition of the Drosophila genome into physical domains. , 2012, Molecular cell.