recount: A large-scale resource of analysis-ready RNA-seq expression data

recount is a resource of processed and summarized expression data spanning nearly 60,000 human RNA-seq samples from the Sequence Read Archive (SRA). The associated recount Bio-conductor package provides a convenient API for querying, downloading, and analyzing the data. Each processed study consists of meta/phenotype data, the expression levels of genes and their underlying exons and splice junctions, and corresponding genomic annotation. We also provide data summarization types for quantifying novel transcribed sequence including base-resolution coverage and potentially unannotated splice junctions. We present workflows illustrating how to use recount to perform differential expression analysis including meta-analysis, annotation-free base-level analysis, and replication of smaller studies using data from larger studies. recount provides a valuable and user-friendly resource of processed RNA-seq datasets to draw additional biological insights from existing public data. The resource is available at https://jhubiostatistics.shinyapps.io/recount/.

[1]  Leonardo Collado-Torres,et al.  Rail-RNA: Scalable analysis of RNA-seq splicing and coverage , 2015, bioRxiv.

[2]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[3]  Wolfgang Huber,et al.  Data-driven hypothesis weighting increases detection power in multiple testing , 2015, bioRxiv.

[4]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[5]  Jeffrey T. Leek,et al.  Rail-dbGaP: a protocol and tool for analyzing protected genomic data in a commercial cloud , 2015 .

[6]  Rafael A. Irizarry,et al.  Flexible expressed region analysis for RNA-seq with derfinder , 2015, bioRxiv.

[7]  Jun S. Liu,et al.  The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans , 2015, Science.

[8]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[9]  J. Davis Bioinformatics and Computational Biology Solutions Using R and Bioconductor , 2007 .

[10]  Miguel Beato,et al.  bwtool: a tool for bigWig files , 2014, Bioinform..

[11]  Galt P. Barber,et al.  BigWig and BigBed: enabling browsing of large distributed datasets , 2010, Bioinform..

[12]  Rafael A. Irizarry,et al.  derfinder: Software for annotation-agnostic RNA-seq differential expression analysis , 2015 .

[13]  D. Dietrich,et al.  Recurrent activating mutation in PRKACA in cortisol-producing adrenal tumors , 2014, Nature Genetics.

[14]  A. Dobra,et al.  Transcriptome profiling of human hippocampus dentate gyrus granule cells in mental illness , 2014, Translational Psychiatry.

[15]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[16]  David P. Kreil,et al.  A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control consortium , 2014, Nature Biotechnology.

[17]  S. Fuqua,et al.  RNA sequencing of cancer reveals novel splicing alterations , 2013, Scientific Reports.

[18]  Sheng Li,et al.  Multi-platform assessment of transcriptome profiling using RNA-seq in the ABRF next-generation sequencing study , 2014, Nature Biotechnology.

[19]  B. Lemos,et al.  Ribosomal DNA copy number is coupled with gene expression variation and mitochondrial abundance in humans , 2014, Nature Communications.

[20]  Rasko Leinonen,et al.  The sequence read archive: explosive growth of sequencing data , 2011, Nucleic Acids Res..

[21]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[22]  Stephen R. Piccolo,et al.  A cloud-based workflow to quantify transcript-expression levels in public cancer compendia , 2016, Scientific Reports.

[23]  Alyssa C. Frazee,et al.  ReCount: A multi-experiment resource of analysis-ready RNA-seq gene count datasets , 2011, BMC Bioinformatics.

[24]  M. Pop,et al.  Robust methods for differential abundance analysis in marker gene surveys , 2013, Nature Methods.

[25]  Judith B. Zaugg,et al.  Data-driven hypothesis weighting increases detection power in genome-scale multiple testing , 2016, Nature Methods.

[26]  Wolfgang Huber,et al.  Love MI, Huber W, Anders S.. Moderated estimation of fold change and dispersion for RNA-Seq data with DESeq2. Genome Biol 15: 550 , 2014 .

[27]  Nuno A. Fonseca,et al.  Expression Atlas update—an integrated database of gene and protein expression in humans, animals and plants , 2015, Nucleic Acids Res..

[28]  C. Kratz,et al.  Faculty Opinions recommendation of Compound inheritance of a low-frequency regulatory SNP and a rare null mutation in exon-junction complex subunit RBM8A causes TAR syndrome. , 2012 .

[29]  Mary Goldman,et al.  Rapid and efficient analysis of 20,000 RNA-seq samples with Toil , 2016, bioRxiv.

[30]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[31]  Daniel Bottomly,et al.  Utilizing RNA-Seq data for de novo coexpression network inference , 2012, Bioinform..

[32]  Dong-Hyung Cho,et al.  A nineteen gene‐based risk score classifier predicts prognosis of colorectal cancer patients , 2014, Molecular oncology.

[33]  R. Simes,et al.  An improved Bonferroni procedure for multiple tests of significance , 1986 .

[34]  Dmitri D. Pervouchine,et al.  The human transcriptome across tissues and individuals , 2015, Science.

[35]  Pedro G. Ferreira,et al.  Transcriptome and genome sequencing uncovers functional variation in humans , 2013, Nature.

[36]  Saurabh Baheti,et al.  An Integrated Model of the Transcriptome of HER2-Positive Breast Cancer , 2013, PloS one.

[37]  Timothy L. Tickle,et al.  Pediatric Crohn disease patients exhibit specific ileal transcriptome and microbiome signature. , 2014, The Journal of clinical investigation.