A Comparative Study of Supervised Machine Learning Algorithms for the Prediction of Long-Range Chromatin Interactions

The role of three-dimensional genome organization as a critical regulator of gene expression has become increasingly clear over the last decade. Most of our understanding of this association comes from the study of long range chromatin interaction maps provided by Chromatin Conformation Capture-based techniques, which have greatly improved in recent years. Since these procedures are experimentally laborious and expensive, in silico prediction has emerged as an alternative strategy to generate virtual maps in cell types and conditions for which experimental data of chromatin interactions is not available. Several methods have been based on predictive models trained on one-dimensional (1D) sequencing features, yielding promising results. However, different approaches vary both in the way they model chromatin interactions and in the machine learning-based strategy they rely on, making it challenging to carry out performance comparison of existing methods. In this study, we use publicly available 1D sequencing signals to model chromatin interactions in two human cell lines and evaluate the prediction performance of 5 popular machine learning algorithms: decision trees, random forests, gradient boosting, support vector machines and multi-layer perceptron. Our approach accurately predicts long-range interactions and reveals that gradient boosting significantly outperforms the other four algorithms, yielding accuracies of ~ 95%. We show that chromatin features in close genomic proximity to the anchors cover most of the predictive information. Moreover, we demonstrate that gradient boosting models trained with different subsets of chromatin features, unlike the other methods tested, are able to produce accurate predictions. In this regard, and besides architectural proteins, transcription factors are shown to be highly informative. Our study provides a framework for the systematic prediction of long-range chromatin interactions, identifies gradient boosting as the best suited algorithm for this task and highlights cell-type specific binding of transcription factors at the anchors as important determinants of chromatin wiring.

[1]  Yun Li,et al.  Gene regulation in the 3D genome. , 2018, Human molecular genetics.

[2]  Martin Vingron,et al.  Dynamic 3D chromatin architecture contributes to enhancer specificity and limb morphogenesis , 2018, Nature Genetics.

[3]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[4]  Przemyslaw Stempor,et al.  SeqPlots - Interactive software for exploratory data analyses, pattern discovery and visualization in genomics , 2016, Wellcome open research.

[5]  Wei Xie,et al.  The role of 3D genome organization in development and cell differentiation , 2019, Nature Reviews Molecular Cell Biology.

[6]  Giacomo Cavalli,et al.  Organization and function of the 3D genome , 2016, Nature Reviews Genetics.

[7]  L. Mirny,et al.  Exploring the three-dimensional organization of genomes: interpreting chromatin interaction data , 2013, Nature Reviews Genetics.

[8]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[9]  Neva C. Durand,et al.  Chromatin extrusion explains key features of loop and domain formation in wild-type and engineered genomes , 2015, Proceedings of the National Academy of Sciences.

[10]  Chih-Jen Lin,et al.  Training and Testing Low-degree Polynomial Data Mappings via Linear SVM , 2010, J. Mach. Learn. Res..

[11]  L. Mirny,et al.  Formation of Chromosomal Domains in Interphase by Loop Extrusion , 2015, bioRxiv.

[12]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[13]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[14]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[15]  Marti A. Hearst Trends & Controversies: Support Vector Machines , 1998, IEEE Intell. Syst..

[16]  Simon Haykin,et al.  Neural Networks: A Comprehensive Foundation , 1998 .

[17]  Dariusz M Plewczynski,et al.  Three-dimensional Epigenome Statistical Model: Genome-wide Chromatin Looping Prediction , 2018, Scientific Reports.

[18]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[19]  François Chollet,et al.  Keras: The Python Deep Learning library , 2018 .

[20]  Deyou Zheng,et al.  Comparison of REST Cistromes across Human Cell Types Reveals Common and Context-Specific Functions , 2014, PLoS Comput. Biol..

[21]  Kim Nasmyth,et al.  Molecular architecture of SMC proteins and the yeast cohesin complex. , 2002, Molecular cell.

[22]  Gilles Louppe,et al.  Independent consultant , 2013 .

[23]  Chee Seng Chan,et al.  CTCF-Mediated Functional Chromatin Interactome in Pluripotent Cells , 2011, Nature Genetics.

[24]  Daniel S. Day,et al.  YY1 Is a Structural Regulator of Enhancer-Promoter Loops , 2017, Cell.

[25]  Neva C. Durand,et al.  A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping , 2014, Cell.

[26]  Deborah Chasman,et al.  In silico prediction of high-resolution Hi-C interaction matrices , 2018, Nature Communications.

[27]  Yin Shen,et al.  Gene regulation in the 3D genome. , 2018, Human molecular genetics.

[28]  Neva C. Durand,et al.  The Energetics and Physiological Impact of Cohesin Extrusion , 2018, Cell.

[29]  Kim Nasmyth,et al.  A Topological Interaction between Cohesin Rings and a Circular Minichromosome , 2005, Cell.

[30]  I. Amit,et al.  Comprehensive mapping of long range interactions reveals folding principles of the human genome , 2011 .

[31]  Stefan Schoenfelder,et al.  Long-range enhancer–promoter contacts in gene expression control , 2019, Nature Reviews Genetics.

[32]  J. Sedat,et al.  Spatial partitioning of the regulatory landscape of the X-inactivation centre , 2012, Nature.

[33]  Jesse R. Dixon,et al.  Topological Domains in Mammalian Genomes Identified by Analysis of Chromatin Interactions , 2012, Nature.

[34]  Weiqun Peng,et al.  Predicting CTCF-mediated chromatin interactions by integrating genomic and epigenomic features , 2017, Nature Communications.

[35]  E. Marco,et al.  Predicting chromatin organization using histone marks , 2015, Genome Biology.

[36]  Raphaël Mourad,et al.  Computational Identification of Genomic Features That Influence 3D Chromatin Domain Formation , 2016, PLoS Comput. Biol..

[37]  Wouter de Laat,et al.  Getting the genome in shape: the formation of loops, domains and compartments , 2015, Genome Biology.

[38]  William Stafford Noble,et al.  Sequence and chromatin determinants of cell-type–specific transcription factor binding , 2012, Genome research.

[39]  Niels Galjart,et al.  Cohesin is positioned in mammalian genomes by transcription, CTCF and Wapl , 2017, Nature.

[40]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[41]  Bas van Steensel,et al.  Genome Architecture: Domain Organization of Interphase Chromosomes , 2013, Cell.

[42]  A. Visel,et al.  Disruptions of Topological Chromatin Domains Cause Pathogenic Rewiring of Gene-Enhancer Interactions , 2015, Cell.

[43]  V. Corces,et al.  A CTCF Code for 3D Genome Architecture , 2015, Cell.

[44]  Wei Wang,et al.  Constructing 3D interaction maps from 1D epigenomes , 2016, Nature Communications.