Exploiting sequence-based features for predicting enhancer–promoter interactions

Motivation: A large number of distal enhancers and proximal promoters form enhancer‐promoter interactions to regulate target genes in the human genome. Although recent high‐throughput genome‐wide mapping approaches have allowed us to more comprehensively recognize potential enhancer‐promoter interactions, it is still largely unknown whether sequence‐based features alone are sufficient to predict such interactions. Results: Here, we develop a new computational method (named PEP) to predict enhancer‐promoter interactions based on sequence‐based features only, when the locations of putative enhancers and promoters in a particular cell type are given. The two modules in PEP (PEP‐Motif and PEP‐Word) use different but complementary feature extraction strategies to exploit sequence‐based information. The results across six different cell types demonstrate that our method is effective in predicting enhancer‐promoter interactions as compared to the state‐of‐the‐art methods that use functional genomic signals. Our work demonstrates that sequence‐based features alone can reliably predict enhancer‐promoter interactions genome‐wide, which could potentially facilitate the discovery of important sequence determinants for long‐range gene regulation. Availability and Implementation: The source code of PEP is available at: https://github.com/ma‐compbio/PEP. Contact: jianma@cs.cmu.edu Supplementary information: Supplementary data are available at Bioinformatics online.

[1]  Neva C. Durand,et al.  A 3D Map of the Human Genome at Kilobase Resolution Reveals Principles of Chromatin Looping , 2014, Cell.

[2]  J. Friedman Greedy function approximation: A gradient boosting machine. , 2001 .

[3]  Raymond K. Auerbach,et al.  Extensive Promoter-Centered Chromatin Interactions Provide a Topological Basis for Transcription Regulation , 2012, Cell.

[4]  Jing Liang,et al.  Chromatin architecture reorganization during stem cell differentiation , 2015, Nature.

[5]  Rhian F. Walther,et al.  Selective Binding of Steroid Hormone Receptors to Octamer Transcription Factors Determines Transcriptional Synergism at the Mouse Mammary Tumor Virus Promoter* , 1999, The Journal of Biological Chemistry.

[6]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[7]  Wange Lu,et al.  Klf4 organizes long-range chromosomal interactions with the oct4 locus in reprogramming and pluripotency. , 2013, Cell stem cell.

[8]  Omer Levy,et al.  word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method , 2014, ArXiv.

[9]  W. Sung,et al.  Chromatin connectivity maps reveal dynamic promoter–enhancer long-range associations , 2013, Nature.

[10]  Chun Qi,et al.  Action recognition using edge trajectories and motion acceleration descriptor , 2016, Machine Vision and Applications.

[11]  Michael Q. Zhang,et al.  Integrative analysis of 111 reference human epigenomes , 2015, Nature.

[12]  Koray Kavukcuoglu,et al.  Learning word embeddings efficiently with noise-contrastive estimation , 2013, NIPS.

[13]  Laurens van der Maaten,et al.  Accelerating t-SNE using tree-based algorithms , 2014, J. Mach. Learn. Res..

[14]  W. Shen,et al.  ZNF143 is involved in CTCF-mediated chromatin interactions by cooperation with cohesin and other partners , 2016, Molecular Biology.

[15]  Swneke D. Bailey,et al.  ZNF143 provides sequence specificity to secure chromatin interactions at gene promoters , 2015, Nature Communications.

[16]  Giacomo Cavalli,et al.  Organization and function of the 3D genome , 2016, Nature Reviews Genetics.

[17]  William Stafford Noble,et al.  FIMO: scanning for occurrences of a given motif , 2011, Bioinform..

[18]  Stephen E. Robertson,et al.  Understanding inverse document frequency: on theoretical arguments for IDF , 2004, J. Documentation.

[19]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[20]  Alireza F. Siahpirani,et al.  A predictive modeling approach for cell line-specific long-range regulatory interactions , 2015, Nucleic acids research.

[21]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[22]  Din J. Wasem,et al.  Mining of Massive Datasets , 2014 .

[23]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[24]  I. Amit,et al.  Comprehensive mapping of long range interactions reveals folding principles of the human genome , 2011 .

[25]  Vladimir B. Bajic,et al.  HOCOMOCO: expansion and enhancement of the collection of transcription factor binding sites models , 2015, Nucleic Acids Res..

[26]  J. Dekker,et al.  The long-range interaction landscape of gene promoters , 2012, Nature.

[27]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[28]  Dariusz M Plewczynski,et al.  CTCF-Mediated Human 3D Genome Architecture Reveals Chromatin Topology for Transcription , 2015, Cell.

[29]  Ron Shamir,et al.  A clustering algorithm based on graph connectivity , 2000, Inf. Process. Lett..

[30]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[31]  Hans Peter Luhn,et al.  A Statistical Approach to Mechanized Encoding and Searching of Literary Information , 1957, IBM J. Res. Dev..

[32]  V. Corces,et al.  CTCF: an architectural protein bridging genome topology and function , 2014, Nature Reviews Genetics.

[33]  K. Pollard,et al.  Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin , 2016, Nature Genetics.

[34]  Rodica Potolea,et al.  Imbalanced Classification Problems: Systematic Study, Issues and Best Practices , 2011, ICEIS.

[35]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[36]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[37]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[38]  William Stafford Noble,et al.  Quantifying similarity between motifs , 2007, Genome Biology.

[39]  J. Friedman Stochastic gradient boosting , 2002 .