Flexible expressed region analysis for RNA-seq with derfinder

Background Differential expression analysis of RNA sequencing (RNA-seq) data typically relies on reconstructing transcripts or counting reads that overlap known gene structures. We previously introduced an intermediate statistical approach called differentially expressed region (DER) finder that seeks to identify contiguous regions of the genome showing differential expression signal at single base resolution without relying on existing annotation or potentially inaccurate transcript assembly. Results We present the derfinder software that improves our annotation-agnostic approach to RNA-seq analysis by: (1) implementing a computationally efficient bump-hunting approach to identify DERs which permits genome-scale analyses in a large number of samples, (2) introducing a flexible statistical modeling framework, including multi-group and time-course analyses and (3) introducing a new set of data visualizations for expressed region analysis. We apply this approach to public RNA-seq data from the Genotype-Tissue Expression (GTEx) project and BrainSpan project to show that derfinder permits the analysis of hundreds of samples at base resolution in R, identifies expression outside of known gene boundaries and can be used to visualize expressed regions at base-resolution. In simulations our base resolution approaches enable discovery in the presence of incomplete annotation and is nearly as powerful as feature-level methods when the annotation is complete. Conclusions derfinder analysis using expressed region-level and single base-level approaches provides a compromise between full transcript reconstruction and feature-level analysis. The package is available from Bioconductor at www.bioconductor.org/packages/derfinder.

[1]  J. Leek,et al.  regionReport: Interactive reports for region-level and feature-level genomic analyses , 2016, F1000Research.

[2]  Alyssa C. Frazee,et al.  Ballgown bridges the gap between transcriptome assembly and expression analysis , 2015, Nature Biotechnology.

[3]  J. Harrow,et al.  Assessment of transcript reconstruction methods for RNA-seq , 2013, Nature Methods.

[4]  Robert Gentleman,et al.  Using GOstats to test gene lists for GO term association , 2007, Bioinform..

[5]  Steven L Salzberg,et al.  HISAT: a fast spliced aligner with low memory requirements , 2015, Nature Methods.

[6]  Weimin Bi,et al.  Aneuploidy as a mechanism for stress-induced liver adaptation. , 2012, The Journal of clinical investigation.

[7]  Mark D. Robinson,et al.  Robustly detecting differential expression in RNA sequencing data using observation weights , 2013, Nucleic acids research.

[8]  Pedro G. Ferreira,et al.  Transcriptome and genome sequencing uncovers functional variation in humans , 2013, Nature.

[9]  Leonardo Collado-Torres,et al.  Rail-RNA: Scalable analysis of RNA-seq splicing and coverage , 2015, bioRxiv.

[10]  David Haussler,et al.  Current status and new features of the Consensus Coding Sequence database , 2013, Nucleic Acids Res..

[11]  Jeffrey T. Leek,et al.  Rail-dbGaP: analyzing dbGaP-protected data in the cloud with Amazon Elastic MapReduce , 2016, Bioinform..

[12]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[13]  Yan Mei,et al.  The RNA-binding protein hnRNPLL induces a T cell alternative splicing program delineated by differential intron retention in polyadenylated RNA , 2014, Genome Biology.

[14]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[15]  Davis J. McCarthy,et al.  Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation , 2012, Nucleic acids research.

[16]  Cole Trapnell,et al.  TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions , 2013, Genome Biology.

[17]  K. Matsuda,et al.  Expression and Properties of Human Liver β-Ureidopropionase , 2001 .

[18]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[19]  Leonardo Collado-Torres,et al.  Developmental regulation of human cortex transcription and its clinical relevance at base resolution , 2014, Nature Neuroscience.

[20]  Andrew G Engel,et al.  Mutations in ZASP define a novel form of muscular dystrophy in humans , 2005, Annals of neurology.

[21]  Jeffrey T Leek,et al.  Significance analysis and statistical dissection of variably methylated regions. , 2012, Biostatistics.

[22]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[23]  Data production leads,et al.  An integrated encyclopedia of DNA elements in the human genome , 2012 .

[24]  J. Kleinman,et al.  Spatiotemporal transcriptome of the human brain , 2011, Nature.

[25]  K. Matsuda,et al.  Expression and properties of human liver beta-ureidopropionase. , 2001, Journal of nutritional science and vitaminology.

[26]  W. Huber,et al.  which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. MAnorm: a robust model for quantitative comparison of ChIP-Seq data sets , 2011 .

[27]  Raymond K. Auerbach,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[28]  Y. Gilad,et al.  RNA-seq: impact of RNA degradation on transcript quantification , 2014, BMC Biology.

[29]  Joseph K. Pickrell,et al.  Noisy Splicing Drives mRNA Isoform Diversity in Human Cells , 2010, PLoS genetics.

[30]  J. Biegel,et al.  ZNF238 is expressed in postmitotic brain cells and inhibits brain tumor growth. , 2010, Cancer research.

[31]  ENCODEConsortium,et al.  An Integrated Encyclopedia of DNA Elements in the Human Genome , 2012, Nature.

[32]  J. Leek,et al.  regionReport: Interactive reports for region-based analyses , 2015, bioRxiv.

[33]  J. Cheer,et al.  Local control of striatal dopamine release , 2014, Front. Behav. Neurosci..

[34]  M. Pop,et al.  Robust methods for differential abundance analysis in marker gene surveys , 2013, Nature Methods.

[35]  Paul Theodor Pyl,et al.  HTSeq—a Python framework to work with high-throughput sequencing data , 2014, bioRxiv.

[36]  Alessandro Vullo,et al.  Ensembl 2015 , 2014, Nucleic Acids Res..

[37]  Gonçalo R. Abecasis,et al.  The Sequence Alignment/Map format and SAMtools , 2009, Bioinform..

[38]  S. Shete,et al.  Myozenin 2 Is a Novel Gene for Human Hypertrophic Cardiomyopathy , 2007, Circulation research.

[39]  Jun S. Liu,et al.  The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans , 2015, Science.

[40]  Jeffrey T Leek,et al.  Differential expression analysis of RNA-seq data at single-base resolution , 2014, Biostatistics.

[41]  Jürgen Hescheler,et al.  The WNT receptor FZD7 contributes to self-renewal signaling of human embryonic stem cells , 2008, Biological chemistry.

[42]  John P Sumpter,et al.  Populations of a cyprinid fish are self-sustaining despite widespread feminization of males , 2014, BMC Biology.

[43]  Wei Shi,et al.  featureCounts: an efficient general purpose program for assigning sequence reads to genomic features , 2013, Bioinform..

[44]  S. Salzberg,et al.  StringTie enables improved reconstruction of a transcriptome from RNA-seq reads , 2015, Nature Biotechnology.

[45]  A. Feinberg,et al.  Increased methylation variation in epigenetic domains across cancer types , 2011, Nature Genetics.

[46]  G. Poulin,et al.  NeuroD1/beta2 contributes to cell-specific transcription of the proopiomelanocortin gene , 1997, Molecular and cellular biology.

[47]  Ellen T. Gelfand,et al.  The Genotype-Tissue Expression (GTEx) project , 2013, Nature Genetics.

[48]  S. Schbath,et al.  Accurate taxonomy assignments in cheeses ecosystems via a metagenomic approach , 2015 .

[49]  Jeffrey T Leek,et al.  Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies. , 2012, International journal of epidemiology.

[50]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[51]  R. Gibbs,et al.  The Drosophila melanogaster transcriptome by paired-end RNA sequencing. , 2011, Genome research.

[52]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[53]  P. Greengard,et al.  ARPP-21, a cyclic AMP-regulated phosphoprotein enriched in dopamine- innervated brain regions. II. Immunocytochemical localization in rat brain , 1989, The Journal of neuroscience : the official journal of the Society for Neuroscience.

[54]  G. Hong,et al.  Nucleic Acids Research , 2015, Nucleic Acids Research.

[55]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[56]  M. Giurfa,et al.  The tarsal taste of honey bees: behavioral and electrophysiological analyses , 2014, Front. Behav. Neurosci..

[57]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[58]  Galt P. Barber,et al.  BigWig and BigBed: enabling browsing of large distributed datasets , 2010, Bioinform..

[59]  Alyssa C. Frazee,et al.  Polyester: Simulating RNA-Seq Datasets With Differential Transcript Expression , 2014, bioRxiv.

[60]  Robert Gentleman,et al.  Software for Computing and Annotating Genomic Ranges , 2013, PLoS Comput. Biol..

[61]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[62]  Peng Cui,et al.  Dynamic regulation of genome-wide pre-mRNA splicing and stress tolerance by the Sm-like protein LSm5 in Arabidopsis , 2014, Genome Biology.