An Annotation Agnostic Algorithm for Detecting Nascent RNA Transcripts in GRO-Seq

We present a fast and simple algorithm to detect nascent RNA transcription in global nuclear run-on sequencing (GRO-seq). GRO-seq is a relatively new protocol that captures nascent transcripts from actively engaged polymerase, providing a direct read-out on bona fide transcription. Most traditional assays, such as RNA-seq, measure steady state RNA levels which are affected by transcription, post-transcriptional processing, and RNA stability. GRO-seq data, however, presents unique analysis challenges that are only beginning to be addressed. Here, we describe a new algorithm, Fast Read Stitcher (FStitch), that takes advantage of two popular machine-learning techniques, hidden Markov models and logistic regression, to classify which regions of the genome are transcribed. Given a small user-defined training set, our algorithm is accurate, robust to varying read depth, annotation agnostic, and fast. Analysis of GRO-seq data without a priori need for annotation uncovers surprising new insights into several aspects of the transcription process.

[1]  R. Elkon,et al.  eRNAs are required for p53-dependent enhancer activity and gene transcription. , 2013, Molecular cell.

[2]  Stephanie L. Hyland,et al.  Identification of active transcriptional regulatory elements with GRO-seq , 2015, Nature Methods.

[3]  C. Danko,et al.  Enhancer transcripts mark active estrogen receptor binding sites , 2013, Genome research.

[4]  Yan Li,et al.  A high-resolution map of three-dimensional chromatin interactome in human cells , 2013, Nature.

[5]  H. Stunnenberg,et al.  Characterization of genome-wide p53-binding sites upon stress response , 2008, Nucleic acids research.

[6]  J. Mesirov,et al.  Metagene projection for cross-platform, cross-species characterization of global transcriptional states , 2007, Proceedings of the National Academy of Sciences.

[7]  Zhenqing Ye,et al.  Cell type-specific binding patterns reveal that TCF7L2 can be tethered to the genome by association with GATA3 , 2012, Genome Biology.

[8]  W. Kraus,et al.  groHMM: a computational tool for identifying unannotated and cell type-specific transcription units from global run-on sequencing data , 2015, BMC Bioinformatics.

[9]  Leighton J. Core,et al.  Regulating RNA polymerase pausing and transcription elongation in embryonic stem cells. , 2011, Genes & development.

[10]  M. Hattori,et al.  Genome-wide profiling of DNA methylation in human cancer cells. , 2011, Genomics.

[11]  Charles Y. Lin,et al.  SR Proteins Collaborate with 7SK and Promoter-Associated Nascent RNA to Release Paused Polymerase , 2013, Cell.

[12]  C. Glass,et al.  Functional roles of enhancer RNAs for oestrogen-dependent transcriptional activation , 2013, Nature.

[13]  Helga Thorvaldsdóttir,et al.  Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration , 2012, Briefings Bioinform..

[14]  Bartek Wilczynski,et al.  Active enhancer positions can be accurately predicted from chromatin marks and collective sequence motif data , 2013, BMC Systems Biology.

[15]  A. Stark,et al.  Transcriptional enhancers: from properties to genome-wide predictions , 2014, Nature Reviews Genetics.

[16]  Lisa Helbling Chadwick,et al.  The NIH Roadmap Epigenomics Program data resource. , 2012, Epigenomics.

[17]  The Uniprot Consortium,et al.  UniProt: a hub for protein information , 2014, Nucleic Acids Res..

[18]  G. McLachlan,et al.  Fitting mixture models to grouped and truncated data via the EM algorithm. , 1988, Biometrics.

[19]  C. Glass,et al.  Vespucci: a system for building annotated databases of nascent transcripts , 2013, Nucleic acids research.

[20]  Manuel E. Lladser,et al.  FStitch: a fast and simple algorithm for detecting nascent RNA transcripts , 2014, BCB.

[21]  R. Maraia,et al.  Comparative overview of RNA polymerase II and III transcription cycles, with focus on RNA polymerase III termination and reinitiation , 2014, Transcription.

[22]  Z. Weng,et al.  A Global Map of p53 Transcription-Factor Binding Sites in the Human Genome , 2006, Cell.

[23]  Andrew McCallum,et al.  Maximum Entropy Markov Models for Information Extraction and Segmentation , 2000, ICML.

[24]  John T. Lis,et al.  Defining mechanisms that regulate RNA polymerase II transcription in vivo , 2009, Nature.

[25]  E. Liu,et al.  An Oestrogen Receptor α-bound Human Chromatin Interactome , 2009, Nature.

[26]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[27]  R. Dowell,et al.  Global analysis of p53-regulated transcription identifies its direct targets and unexpected regulatory mechanisms , 2014, eLife.

[28]  B. Langmead,et al.  Aligning Short Sequencing Reads with Bowtie , 2010, Current protocols in bioinformatics.

[29]  G. Kreiman,et al.  Widespread transcription at neuronal activity-regulated enhancers , 2010, Nature.

[30]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[31]  Leighton J. Core,et al.  Nascent RNA Sequencing Reveals Widespread Pausing and Divergent Initiation at Human Promoters , 2008, Science.

[32]  Leighton J. Core,et al.  A Rapid, Extensive, and Transient Transcriptional Response to Estrogen Signaling in Breast Cancer Cells , 2011, Cell.

[33]  T Kivioja,et al.  Insights into p53 transcriptional function via genome-wide chromatin occupancy and gene expression analysis , 2012, Cell Death and Differentiation.

[34]  Jenq-Neng Hwang,et al.  Robust speech recognition based on joint model and feature space optimization of hidden Markov models , 1997, IEEE Trans. Neural Networks.

[35]  Hendrik G. Stunnenberg,et al.  Role of p53 Serine 46 in p53 Target Gene Regulation , 2011, PloS one.

[36]  J. Dekker,et al.  The long-range interaction landscape of gene promoters , 2012, Nature.

[37]  Gene W. Yeo,et al.  Divergent Transcription from Active Promoters , 2008, Science.

[38]  María Martín,et al.  UniProt: A hub for protein information , 2015 .

[39]  Clifford A. Meyer,et al.  chromatin dynamics Differential DNase I hypersensitivity reveals factor-dependent Material Supplemental , 2012 .

[40]  Leighton J. Core,et al.  X chromosome dosage compensation via enhanced transcriptional elongation in Drosophila , 2010, Nature.

[41]  N. D. Clarke,et al.  Integrative model of genomic factors for determining binding site selection by estrogen receptor-α , 2010, Molecular systems biology.

[42]  Nizar Bouguila,et al.  A hybrid SEM algorithm for high-dimensional unsupervised learning using a finite generalized Dirichlet mixture , 2006, IEEE Transactions on Image Processing.

[43]  Melissa J. Landrum,et al.  RefSeq: an update on mammalian reference sequences , 2013, Nucleic Acids Res..

[44]  Michael F. Melgar,et al.  Discovery of active enhancers through bidirectional expression of short transcripts , 2011, Genome Biology.

[45]  C. Glass,et al.  Reprogramming Transcription via Distinct Classes of Enhancers Functionally Defined by eRNA , 2011, Nature.

[46]  W. Huber,et al.  Differential expression analysis for sequence count data , 2010 .

[47]  A. Sandelin,et al.  PROMoter uPstream Transcripts share characteristics with mRNAs and are produced upstream of all three major types of mammalian promoters , 2011, Nucleic acids research.

[48]  L. Tora,et al.  How to stop , 2013, Transcription.

[49]  Alexander S. Garruss,et al.  The little elongation complex functions at initiation and elongation phases of snRNA gene transcription. , 2013, Molecular cell.

[50]  C. Mason,et al.  Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data , 2013, Genome Biology.

[51]  D. Gresham,et al.  Determination of in vivo RNA kinetics using RATE-seq , 2014, RNA.

[52]  Lucila Ohno-Machado,et al.  Logistic regression and artificial neural network classification models: a methodology review , 2002, J. Biomed. Informatics.

[53]  T. Gingeras,et al.  Genome-wide transcription and the implications for genomic organization , 2007, Nature Reviews Genetics.