Predicting CTCF-mediated chromatin loops using CTCF-MP

Motivation The three dimensional organization of chromosomes within the cell nucleus is highly regulated. It is known that CCCTC‐binding factor (CTCF) is an important architectural protein to mediate long‐range chromatin loops. Recent studies have shown that the majority of CTCF binding motif pairs at chromatin loop anchor regions are in convergent orientation. However, it remains unknown whether the genomic context at the sequence level can determine if a convergent CTCF motif pair is able to form a chromatin loop. Results In this article, we directly ask whether and what sequence‐based features (other than the motif itself) may be important to establish CTCF‐mediated chromatin loops. We found that motif conservation measured by ‘branch‐of‐origin' that accounts for motif turn‐over in evolution is an important feature. We developed a new machine learning algorithm called CTCF‐MP based on word2vec to demonstrate that sequence‐based features alone have the capability to predict if a pair of convergent CTCF motifs would form a loop. Together with functional genomic signals from CTCF ChIP‐seq and DNase‐seq, CTCF‐MP is able to make highly accurate predictions on whether a convergent CTCF motif pair would form a loop in a single cell type and also across different cell types. Our work represents an important step further to understand the sequence determinants that may guide the formation of complex chromatin architectures. Availability and implementation The source code of CTCF‐MP can be accessed at: https://github.com/ma‐compbio/CTCF‐MP

[1]  Robert E. Schapire,et al.  The strength of weak learnability , 1990, Mach. Learn..

[2]  J. Friedman Stochastic gradient boosting , 2002 .

[3]  S. Batzoglou,et al.  Distribution and intensity of constraint in mammalian genomic sequence. , 2005, Genome research.

[4]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[5]  David Haussler,et al.  New Methods for Detecting Lineage-Specific Selection , 2006, RECOMB.

[6]  D. Gifford,et al.  Tissue-specific transcriptional regulation has diverged significantly between human and mouse , 2007, Nature Genetics.

[7]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[8]  I. Amit,et al.  Comprehensive mapping of long range interactions reveals folding principles of the human genome , 2011 .

[9]  Y. Ruan,et al.  ChIP‐based methods for the identification of long‐range chromatin interactions , 2009, Journal of cellular biochemistry.

[10]  Michael D. Wilson,et al.  Five-Vertebrate ChIP-seq Reveals the Evolutionary Dynamics of Transcription Factor Binding , 2010, Science.

[11]  Chee Seng Chan,et al.  CTCF-Mediated Functional Chromatin Interactome in Pluripotent Cells , 2011, Nature Genetics.

[12]  William Stafford Noble,et al.  FIMO: scanning for occurrences of a given motif , 2011, Bioinform..

[13]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[14]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[15]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[16]  S. Hannenhalli,et al.  CTCF binding site sequence differences are associated with unique regulatory and functional trends during embryonic stem cell differentiation , 2013, Nucleic acids research.

[17]  Jian Ma,et al.  Tracing the Evolution of Lineage-Specific Transcription Factor Binding Sites in a Birth-Death Framework , 2014, PLoS Comput. Biol..

[18]  Neva C. Durand,et al.  A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping , 2014, Cell.

[19]  Giacomo Cavalli,et al.  The Role of Chromosome Domains in Shaping the Functional Genome , 2015, Cell.

[20]  Ehsaneddin Asgari,et al.  ProtVec: A Continuous Distributed Representation of Biological Sequences , 2015, ArXiv.

[21]  Dariusz M Plewczynski,et al.  CTCF-Mediated Human 3D Genome Architecture Reveals Chromatin Topology for Transcription , 2015, Cell.

[22]  Michael Q. Zhang,et al.  CRISPR Inversion of CTCF Sites Alters Genome Topology and Enhancer/Promoter Function , 2015, Cell.

[23]  T. Misteli,et al.  Long-Range Chromatin Interactions. , 2015, Cold Spring Harbor perspectives in biology.

[24]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[25]  Giacomo Cavalli,et al.  Organization and function of the 3D genome , 2016, Nature Reviews Genetics.

[26]  L. Mirny,et al.  The 3D Genome as Moderator of Chromosomal Communication , 2016, Cell.

[27]  Peter H. L. Krijger,et al.  Regulation of disease-associated gene expression in the 3D genome , 2016, Nature Reviews Molecular Cell Biology.

[28]  Ruochi Zhang,et al.  Exploiting sequence-based features for predicting enhancer–promoter interactions , 2017, Bioinform..

[29]  L. Mirny,et al.  Targeted Degradation of CTCF Decouples Local Insulation of Chromosome Domains from Genomic Compartmentalization , 2017, Cell.

[30]  Weiqun Peng,et al.  Predicting CTCF-mediated chromatin interactions by integrating genomic and epigenomic features , 2017, Nature Communications.

[31]  David J. Arenillas,et al.  JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework , 2017, Nucleic acids research.

[32]  David J. Arenillas,et al.  JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework , 2017, Nucleic acids research.