Accurate Promoter and Enhancer Identification in 127 ENCODE and Roadmap Epigenomics Cell Types and Tissues by GenoSTAN

Accurate maps of promoters and enhancers are required for understanding transcriptional regulation. Promoters and enhancers are usually mapped by integration of chromatin assays charting histone modifications, DNA accessibility, and transcription factor binding. However, current algorithms are limited by unrealistic data distribution assumptions. Here we propose GenoSTAN (Genomic STate ANnotation), a hidden Markov model overcoming these limitations. We map promoters and enhancers for 127 cell types and tissues from the ENCODE and Roadmap Epigenomics projects, today’s largest compendium of chromatin assays. Extensive benchmarks demonstrate that GenoSTAN generally identifies promoters and enhancers with significantly higher accuracy than previous methods. Moreover, GenoSTAN-derived promoters and enhancers showed significantly higher enrichment of complex trait-associated genetic variants than current annotations. Altogether, GenoSTAN provides an easy-to-use tool to define promoters and enhancers in any system, and our annotation of human transcriptional cis-regulatory elements constitutes a rich resource for future research in biology and medicine.

[1]  M. Bulmer On Fitting the Poisson Lognormal Distribution to Species-Abundance Data , 1974 .

[2]  J. Banerji,et al.  Expression of a beta-globin gene is enhanced by remote SV40 DNA sequences. , 1981, Cell.

[3]  J. Banerji,et al.  Expression of a β-globin gene is enhanced by remote SV40 DNA sequences , 1981, Cell.

[4]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[5]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[6]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[7]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[8]  Edoardo M. Airoldi,et al.  Notes on the Negative Binomial distribution for word occurrences , 2005 .

[9]  William Stafford Noble,et al.  Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project , 2007, Nature.

[10]  Nathaniel D. Heintzman,et al.  Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome , 2007, Nature Genetics.

[11]  Robert Gentleman,et al.  rtracklayer: an R package for interfacing with genome browsers , 2009, Bioinform..

[12]  Eric T. Wang,et al.  An Abundance of Ubiquitously Expressed Genes Revealed by Tissue Transcriptome Sequence Data , 2009, PLoS Comput. Biol..

[13]  R. Rosenfeld Nature , 2009, Otolaryngology--head and neck surgery : official journal of American Academy of Otolaryngology-Head and Neck Surgery.

[14]  A. Visel,et al.  ChIP-seq accurately predicts tissue-specific activity of enhancers , 2009, Nature.

[15]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[16]  G. Kreiman,et al.  Widespread transcription at neuronal activity-regulated enhancers , 2010, Nature.

[17]  Timothy J. Durham,et al.  "Systematic" , 1966, Comput. J..

[18]  Sündüz Keles,et al.  Normalization of ChIP-seq data with control , 2012, BMC Bioinformatics.

[19]  Achim Tresch,et al.  Dynamic transcriptome analysis measures rates of mRNA synthesis and decay in yeast , 2011, Molecular systems biology.

[20]  Manolis Kellis,et al.  Discovery and Characterization of Chromatin States for Systematic Annotation of the Human Genome , 2011, RECOMB.

[21]  Albert J. Vilella,et al.  A high-resolution map of human evolutionary constraint using 29 mammals , 2011, Nature.

[22]  Michael A. Beer,et al.  Discriminative prediction of mammalian enhancers from DNA sequence. , 2011, Genome research.

[23]  Lovelace J. Luquette,et al.  Comprehensive analysis of the chromatin landscape in Drosophila , 2010, Nature.

[24]  J. Weissman,et al.  Nascent transcript sequencing visualizes transcription at nucleotide resolution , 2011, Nature.

[25]  P. Scacheri,et al.  Epigenetic signatures distinguish multiple classes of enhancers with distinct cellular functions. , 2011, Genome research.

[26]  Kevin Y. Yip,et al.  Classification of human genomic regions based on experimentally determined binding sites of more than 100 transcription-related factors , 2012, Genome Biology.

[27]  Barry J Dickson,et al.  HOT regions function as patterned developmental enhancers and have a distinct cis-regulatory signature. , 2012, Genes & development.

[28]  David Z. Chen,et al.  Architecture of the human regulatory network derived from ENCODE data , 2012, Nature.

[29]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[30]  A. Visel,et al.  Large-Scale Discovery of Enhancers from Human Heart Tissue , 2011, Nature Genetics.

[31]  Steven L Salzberg,et al.  Fast gapped-read alignment with Bowtie 2 , 2012, Nature Methods.

[32]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[33]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[34]  William Stafford Noble,et al.  Unsupervised pattern discovery in human chromatin structure through genomic segmentation , 2012, Nature Methods.

[35]  A. Sandelin,et al.  Metazoan promoters: emerging characteristics and insights into transcriptional regulation , 2012, Nature Reviews Genetics.

[36]  William Stafford Noble,et al.  Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors , 2012, Genome research.

[37]  Manolis Kellis,et al.  ChromHMM: automating chromatin-state discovery and characterization , 2012, Nature Methods.

[38]  Manolis Kellis,et al.  Interplay between chromatin state, regulator binding, and regulatory motifs in six human cell types , 2013, Genome research.

[39]  Łukasz M. Boryń,et al.  Genome-Wide Quantitative Enhancer Activity Maps Identified by STARR-seq , 2013, Science.

[40]  Wei Wang,et al.  Comparative annotation of functional regions in the human genome using epigenomic data , 2013, Nucleic acids research.

[41]  Buhm Han,et al.  Chromatin marks identify critical cell types for fine mapping complex trait variants , 2012 .

[42]  Nicolas Servant,et al.  A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis , 2013, Briefings Bioinform..

[43]  William Stafford Noble,et al.  Integrative annotation of chromatin elements from ENCODE data , 2012, Nucleic acids research.

[44]  Wei Xie,et al.  RFECS: A Random-Forest Based Algorithm for Enhancer Identification from Chromatin State , 2013, PLoS Comput. Biol..

[45]  T. Mikkelsen,et al.  Systematic dissection of regulatory motifs in 2000 predicted human enhancers using a massively parallel reporter assay. , 2013, Genome research.

[46]  Robert Gentleman,et al.  Software for Computing and Annotating Genomic Ranges , 2013, PLoS Comput. Biol..

[47]  C. Sander,et al.  Genome-wide analysis of non-coding regulatory mutations in cancer , 2014, Nature Genetics.

[48]  André L. Martins,et al.  Analysis of nascent RNA identifies a unified architecture of initiation regions at mammalian promoters and enhancers , 2014, Nature Genetics.

[49]  Peggy Hall,et al.  The NHGRI GWAS Catalog, a curated resource of SNP-trait associations , 2013, Nucleic Acids Res..

[50]  B. Cohen,et al.  High-throughput functional testing of ENCODE segmentation predictions , 2014, Genome research.

[51]  T. Meehan,et al.  An atlas of active enhancers across human cell types and tissues , 2014, Nature.

[52]  A. Stark,et al.  Transcriptional enhancers: from properties to genome-wide predictions , 2014, Nature Reviews Genetics.

[53]  Cesare Furlanello,et al.  A promoter-level mammalian expression atlas , 2015 .

[54]  N. Friedman,et al.  High-Resolution Sequencing and Modeling Identifies Distinct Dynamic RNA Regulatory Strategies , 2014, Cell.

[55]  Achim Tresch,et al.  Annotation of genomics data using bidirectional hidden Markov models unveils variations in Pol II transcription cycle , 2014, Molecular systems biology.

[56]  P. Flicek,et al.  The Ensembl Regulatory Build , 2015, Genome Biology.

[57]  Feng Liu,et al.  Functional annotation of HOT regions in the human genome: implications for human disease and cancer , 2015, Scientific Reports.

[58]  V. Bajic,et al.  DEEP: a general computational framework for predicting enhancers , 2014, Nucleic acids research.

[59]  Ho-Ryun Chung,et al.  Chromatin segmentation based on a probabilistic model for read counts explains a large portion of the epigenome , 2015, Genome Biology.

[60]  R. Andersson Promoter or enhancer, what's the difference? Deconstruction of established distinctions and presentation of a unifying model , 2015, BioEssays : news and reviews in molecular, cellular and developmental biology.

[61]  C. Glass,et al.  The selection and function of cell type-specific enhancers , 2015, Nature Reviews Molecular Cell Biology.

[62]  M. Pellegrini,et al.  Scl binds to primed enhancers in mesoderm to regulate hematopoietic and cardiac fate divergence , 2015, The EMBO journal.

[63]  Manolis Kellis,et al.  Large-scale epigenome imputation improves data quality and disease variant enrichment , 2015, Nature Biotechnology.

[64]  Michael Q. Zhang,et al.  Integrative analysis of 111 reference human epigenomes , 2015, Nature.

[65]  Panos Kalnis,et al.  Progress and challenges in bioinformatics approaches for enhancer identification , 2015, Briefings Bioinform..

[66]  Matthew D. Edwards,et al.  High-throughput mapping of regulatory DNA , 2016, Nature Biotechnology.

[67]  J. Gagneur,et al.  TT-seq maps the human transient transcriptome , 2016, Science.

[68]  Manolis Kellis,et al.  HaploReg v4: systematic mining of putative causal variants, cell types, regulators and target genes for human complex traits and disease , 2015, Nucleic Acids Res..