recount-brain: a curated repository of human brain RNA-seq datasets metadata

The usability of publicly-available gene expression data is often limited by the availability of high-quality, standardized biological phenotype and experimental condition information (“metadata”). We released the recount2 project, which involved re-processing ∼70,000 samples in the Sequencing Read Archive (SRA), Genotype-Tissue Expression (GTEx), and The Cancer Genome Atlas (TCGA) projects. While samples from the latter two projects are well-characterized with extensive metadata, the ∼50,000 RNA-seq samples from SRA in recount2 are inconsistently annotated with metadata. Tissue type, sex, and library type can be estimated from the RNA sequencing (RNA-seq) data itself. However, more detailed and harder to predict metadata, like age and diagnosis, must ideally be provided by labs that deposit the data. To facilitate more analyses within human brain tissue data, we have complemented phenotype predictions by manually constructing a uniformly-curated database of public RNA-seq samples present in SRA and recount2. We describe the reproducible curation process for constructing recount-brain that involves systematic review of the primary manuscript, which can serve as a guide to annotate other studies and tissues. We further expanded recount-brain by merging it with GTEx and TCGA brain samples as well as linking to controlled vocabulary terms for tissue, Brodmann area and disease. Furthermore, we illustrate how to integrate the sample metadata in recount-brain with the gene expression data in recount2 to perform differential expression analysis. We then provide three analysis examples involving modeling postmortem interval, glioblastoma, and meta-analyses across GTEx and TCGA. Overall, recount-brain facilitates expression analyses and improves their reproducibility as individual researchers do not have to manually curate the sample metadata. recount-brain is available via the add_metadata() function from the recount Bioconductor package at bioconductor.org/packages/recount.

[1]  Yi Xing,et al.  Transcriptome-wide Discovery of microRNA Binding Sites in Human Brain , 2014, Neuron.

[2]  Guangchuang Yu,et al.  clusterProfiler: an R package for comparing biological themes among gene clusters. , 2012, Omics : a journal of integrative biology.

[3]  D. Haussler,et al.  The Somatic Genomic Landscape of Glioblastoma , 2013, Cell.

[4]  B. Langmead,et al.  Cloud computing for genomic data analysis and collaboration , 2018, Nature Reviews Genetics.

[5]  Jj Allaire,et al.  Web Application Framework for R , 2016 .

[6]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[7]  Matthew E. Ritchie,et al.  limma powers differential expression analyses for RNA-sequencing and microarray studies , 2015, Nucleic acids research.

[8]  Joshua M. Korn,et al.  Comprehensive genomic characterization defines human glioblastoma genes and core pathways , 2008, Nature.

[9]  Z. Weng,et al.  RNA Sequence Analysis of Human Huntington Disease Brain Reveals an Extensive Increase in Inflammatory and Developmental Gene Expression , 2015, PloS one.

[10]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[11]  G. Cochrane,et al.  The International Nucleotide Sequence Database Collaboration , 2011, Nucleic Acids Res..

[12]  Leonardo Collado-Torres,et al.  recount workflow: Accessing over 70,000 human RNA-seq samples with Bioconductor , 2017, F1000Research.

[13]  S. Lewis,et al.  Uberon, an integrative multi-species anatomy ontology , 2012, Genome Biology.

[14]  Davis J. McCarthy,et al.  Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation , 2012, Nucleic acids research.

[15]  Michael Snyder,et al.  Integrated systems analysis reveals a molecular network underlying autism spectrum disorders , 2014, Molecular systems biology.

[16]  Rasko Leinonen,et al.  The sequence read archive: explosive growth of sequencing data , 2011, Nucleic Acids Res..

[17]  Leonardo Collado-Torres,et al.  RNA-seq transcript quantification from reduced-representation data in recount2 , 2018, bioRxiv.

[18]  S. Horvath,et al.  Transcriptomic Analysis of Autistic Brain Reveals Convergent Molecular Pathology , 2011, Nature.

[19]  Evan Bolton,et al.  Database resources of the National Center for Biotechnology Information , 2017, Nucleic Acids Res..

[20]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[21]  M. Cairns,et al.  Transcriptome Sequencing Revealed Significant Alteration of Cortical Promoter Usage and Splicing in Schizophrenia , 2012, PloS one.

[22]  Javad Golji,et al.  Integrative analyses of proteomics and RNA transcriptomics implicate mitochondrial processes, protein folding pathways and GWAS loci in Parkinson disease , 2015, BMC Medical Genomics.

[23]  Shannon E. Ellis,et al.  Improving the value of public RNA-seq expression data by phenotype prediction , 2017, bioRxiv.

[24]  David Landsman,et al.  Workflow and web application for annotating NCBI BioProject transcriptome data , 2017, Database J. Biol. Databases Curation.

[25]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[26]  Lawrence A. Donehower,et al.  The somatic genomic landscape of chromophobe renal cell carcinoma. , 2014, Cancer cell.

[27]  J. Zhou,et al.  Overexpression of SMC4 activates TGFβ/Smad signaling and promotes aggressive phenotype in glioma cells , 2017, Oncogenesis.

[28]  David P. Kreil,et al.  A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control consortium , 2014, Nature Biotechnology.

[29]  Davide Castelvecchi,et al.  Google unveils search engine for open data , 2018, Nature.

[30]  C. Hutter,et al.  The Cancer Genome Atlas: Creating Lasting Value beyond Its Data , 2018, Cell.

[31]  Jeffrey T Leek,et al.  Reproducible RNA-seq analysis using recount2 , 2017, Nature Biotechnology.

[32]  Jun S. Liu,et al.  The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans , 2015, Science.

[33]  Abdullah M. Khamis,et al.  Regional differences in gene expression and promoter usage in aged human brains , 2013, Neurobiology of Aging.

[34]  Mikhail S. Gelfand,et al.  Neanderthal ancestry drives evolution of lipid catabolism in contemporary Europeans , 2014, Nature Communications.

[35]  Csongor Nyulas,et al.  BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications , 2011, Nucleic Acids Res..

[36]  M. Faghihi,et al.  Transcriptomics Profiling of Alzheimer’s Disease Reveal Neurovascular Defects, Altered Amyloid-β Homeostasis, and Deregulated Expression of Long Noncoding RNAs , 2015, Journal of Alzheimer's disease : JAD.

[37]  Steven J. M. Jones,et al.  Comprehensive, Integrative Genomic Analysis of Diffuse Lower-Grade Gliomas. , 2015, The New England journal of medicine.

[38]  Peter Canoll,et al.  MRI-localized biopsies reveal subtype-specific differences in molecular and cellular composition at the margins of glioblastoma , 2014, Proceedings of the National Academy of Sciences.

[39]  Gregory D. Schuler,et al.  Database resources of the National Center for Biotechnology Information: update , 2004, Nucleic acids research.

[40]  J. Akers,et al.  RNA-seq of 272 gliomas revealed a novel, recurrent PTPRZ1-MET fusion transcript in secondary glioblastomas , 2014, Genome research.

[41]  Abhinav Nellore,et al.  Snaptron: querying splicing patterns across tens of thousands of RNA-seq samples , 2017, Bioinform..

[42]  AnHai Doan,et al.  MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive , 2017, Bioinform..

[43]  Bronwen L. Aken,et al.  GENCODE: The reference human genome annotation for The ENCODE Project , 2012, Genome research.

[44]  Dmitri D. Pervouchine,et al.  The effects of death and post-mortem cold ischemia on human tissue transcriptomes , 2018, Nature Communications.