DELTA: A Distal Enhancer Locating Tool Based on AdaBoost Algorithm and Shape Features of Chromatin Modifications

Accurate identification of DNA regulatory elements becomes an urgent need in the post-genomic era. Recent genome-wide chromatin states mapping efforts revealed that DNA elements are associated with characteristic chromatin modification signatures, based on which several approaches have been developed to predict transcriptional enhancers. However, their practical application is limited by incomplete extraction of chromatin features and model inconsistency for predicting enhancers across different cell types. To address these issues, we define a set of non-redundant shape features of histone modifications, which shows high consistency across cell types and can greatly reduce the dimensionality of feature vectors. Integrating shape features with a machine-learning algorithm AdaBoost, we developed an enhancer predicting method, DELTA (Distal Enhancer Locating Tool based on AdaBoost). We show that DELTA significantly outperforms current enhancer prediction methods in prediction accuracy on different datasets and can predict enhancers in one cell type using models trained in other cell types without loss of accuracy. Overall, our study presents a novel framework for accurately identifying enhancers from epigenetic data across multiple cell types.

[1]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[2]  P. Rousseeuw Silhouettes: a graphical aid to the interpretation and validation of cluster analysis , 1987 .

[3]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1995, EuroCOLT.

[4]  Tom H. Pringle,et al.  The human genome browser at UCSC. , 2002, Genome research.

[5]  I. Talianidis,et al.  Dynamics of enhancer-promoter communication during differentiation-induced gene activation. , 2002, Molecular cell.

[6]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[7]  D. Haussler,et al.  Ultraconserved Elements in the Human Genome , 2004, Science.

[8]  Fabrice Labeau,et al.  Discrete Time Signal Processing , 2004 .

[9]  Paul T. Groth,et al.  The ENCODE (ENCyclopedia Of DNA Elements) Project , 2004, Science.

[10]  Eric Bauer,et al.  An Empirical Comparison of Voting Classification Algorithms: Bagging, Boosting, and Variants , 1999, Machine Learning.

[11]  Myles A Brown,et al.  Spatial and temporal recruitment of androgen receptor and its coactivators involves chromosomal looping and polymerase tracking. , 2005, Molecular cell.

[12]  U. Ohler Identification of core promoter modules in Drosophila and their application in accurate transcription start site prediction , 2006, Nucleic acids research.

[13]  David R. Cox,et al.  The Oxford Dictionary of Statistical Terms , 2006 .

[14]  Mark Culp,et al.  ada: An R Package for Stochastic Boosting , 2006 .

[15]  Ivan Ovcharenko,et al.  Predicting tissue-specific enhancers in the human genome. , 2006, Genome research.

[16]  Michael Q. Zhang,et al.  Analysis of the Vertebrate Insulator Protein CTCF-Binding Sites in the Human Genome , 2007, Cell.

[17]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[18]  Dustin E. Schones,et al.  High-Resolution Profiling of Histone Methylations in the Human Genome , 2007, Cell.

[19]  Nathaniel D. Heintzman,et al.  Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome , 2007, Nature Genetics.

[20]  Axel Visel,et al.  Enhancer identification through comparative genomics. , 2006, Seminars in cell & developmental biology.

[21]  Clifford A. Meyer,et al.  Model-based Analysis of ChIP-Seq (MACS) , 2008, Genome Biology.

[22]  Bing Ren,et al.  ChromaSig: A Probabilistic Approach to Finding Common Chromatin Signatures in the Human Genome , 2008, PLoS Comput. Biol..

[23]  Michael Q. Zhang,et al.  Combinatorial patterns of histone acetylations and methylations in the human genome , 2008, Nature Genetics.

[24]  Z. Weng,et al.  High-Resolution Mapping and Characterization of Open Chromatin across the Genome , 2008, Cell.

[25]  Bing Ren,et al.  Prediction of regulatory elements in mammalian genomes using chromatin signatures , 2008, BMC Bioinformatics.

[26]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[27]  A. Visel,et al.  ChIP-seq accurately predicts tissue-specific activity of enhancers , 2009, Nature.

[28]  P. Park ChIP–seq: advantages and challenges of a maturing technology , 2009, Nature Reviews Genetics.

[29]  B. Ren,et al.  Genome-wide prediction of transcription factor binding sites using an integrated model , 2010, Genome Biology.

[30]  Dustin E. Schones,et al.  Genome-wide Mapping of HATs and HDACs Reveals Distinct Functions in Active and Inactive Genes , 2009, Cell.

[31]  Nathaniel D. Heintzman,et al.  Histone modifications at human enhancers reflect global cell-type-specific gene expression , 2009, Nature.

[32]  Mikael Bodén,et al.  MEME Suite: tools for motif discovery and searching , 2009, Nucleic Acids Res..

[33]  Kai Tan,et al.  Discover regulatory DNA elements using chromatin signatures and artificial neural network , 2010, Bioinform..

[34]  T. Mikkelsen,et al.  The NIH Roadmap Epigenomics Mapping Consortium , 2010, Nature Biotechnology.

[35]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[36]  Timothy J. Durham,et al.  "Systematic" , 1966, Comput. J..

[37]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[38]  Brent S. Pedersen,et al.  Pybedtools: a flexible Python library for manipulating genomic datasets and annotations , 2011, Bioinform..

[39]  Robert L. Grossman,et al.  A cis-regulatory map of the Drosophila genome , 2011, Nature.

[40]  V. Corces,et al.  Enhancer function: new insights into the regulation of tissue-specific gene expression , 2011, Nature Reviews Genetics.

[41]  Jianrong Wang,et al.  Chromatin signature discovery via histone modification profile alignments , 2012, Nucleic acids research.

[42]  William Stafford Noble,et al.  Unsupervised pattern discovery in human chromatin structure through genomic segmentation , 2012, Nature Methods.

[43]  Lee E. Edsall,et al.  A map of the cis-regulatory sequences in the mouse genome , 2012, Nature.

[44]  Michael Fernández,et al.  Genome-wide enhancer prediction from epigenetic signatures using genetic algorithm-optimized support vector machines , 2012, Nucleic acids research.

[45]  William Stafford Noble,et al.  Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors , 2012, Genome research.

[46]  M. Gerstein,et al.  Modeling the relative relationship of transcription factor binding and histone modifications to gene expression levels in mouse embryonic stem cells , 2011, Nucleic acids research.

[47]  Manolis Kellis,et al.  ChromHMM: automating chromatin-state discovery and characterization , 2012, Nature Methods.

[48]  Wei Xie,et al.  RFECS: A Random-Forest Based Algorithm for Enhancer Identification from Chromatin State , 2013, PLoS Comput. Biol..

[49]  J. Kawai,et al.  The enhancer and promoter landscape of human regulatory and conventional T-cell subpopulations. , 2014, Blood.

[50]  Yiming Lu,et al.  Modelling epigenetic regulation of gene expression in 12 human cell types reveals combinatorial patterns of cell-type-specific genes. , 2014, IET systems biology.