GREIN: An Interactive Web Platform for Re-analyzing GEO RNA-seq Data

The vast amount of RNA-seq data deposited in Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA) is still a grossly underutilized resource for biomedical research. To remove technical roadblocks for reusing these data, we have developed a web-application GREIN (GEO RNA-seq Experiments Interactive Navigator) which provides user-friendly interfaces to manipulate and analyze GEO RNA-seq data. GREIN is powered by the back-end computational pipeline for uniform processing of RNA-seq data and the large number (>6,500) of already processed datasets. The front-end user interfaces provide a wealth of user-analytics options including sub-setting and downloading processed data, interactive visualization, statistical power analyses, construction of differential gene expression signatures and their comprehensive functional characterization, and connectivity analysis with LINCS L1000 data. The combination of the massive amount of back-end data and front-end analytics options driven by user-friendly interfaces makes GREIN a unique open-source resource for re-using GEO RNA-seq data. GREIN is accessible at: https://shiny.ilincs.org/grein, the source code at: https://github.com/uc-bd2k/grein, and the Docker container at: https://hub.docker.com/r/ucbd2k/grein.

[1]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[2]  C. Kahn,et al.  Analysis of gene expression in pathophysiological states: balancing false discovery and false negative rates. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[3]  Ying Wang,et al.  Evaluation of the coverage and depth of transcriptome by RNA-Seq in chickens , 2011, BMC Bioinformatics.

[4]  Nuno A. Fonseca,et al.  Expression Atlas: gene and protein expression across multiple studies and organisms , 2017, Nucleic Acids Res..

[5]  S. Ramón y. Cajal,et al.  Hypoxia-mediated translational activation of ITGB3 in breast cancer cells enhances TGF-β signaling and malignant features in vitro and in vivo , 2017, Oncotarget.

[6]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[7]  Hideaki Sugawara,et al.  The Sequence Read Archive , 2010, Nucleic Acids Res..

[8]  Sin-Ho Jung,et al.  Sample size for FDR-control in microarray data analysis , 2005, Bioinform..

[9]  Måns Magnusson,et al.  MultiQC: summarize analysis results for multiple tools and samples in a single report , 2016, Bioinform..

[10]  Björn Usadel,et al.  Trimmomatic: a flexible trimmer for Illumina sequence data , 2014, Bioinform..

[11]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[12]  A. Brazma,et al.  Reuse of public genome-wide gene expression data , 2012, Nature Reviews Genetics.

[13]  Steven N. Hart,et al.  Calculating Sample Size Estimates for RNA Sequencing Data , 2013, J. Comput. Biol..

[14]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[15]  Rob Patro,et al.  Salmon provides fast and bias-aware quantification of transcript expression , 2017, Nature Methods.

[16]  Jeffrey T Leek,et al.  Reproducible RNA-seq analysis using recount2 , 2017, Nature Biotechnology.

[17]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[18]  E. Birney,et al.  Reactome: a knowledgebase of biological pathways , 2004, Nucleic Acids Research.

[19]  Hannah Carter,et al.  Extracting allelic read counts from 250,000 human sequencing runs in Sequence Read Archive , 2018, bioRxiv.

[20]  Kathleen M Jagodnik,et al.  Massive mining of publicly available RNA-seq data from human and mouse , 2017, Nature Communications.

[21]  Mary Goldman,et al.  Toil enables reproducible, open source, big biomedical data analyses , 2017, Nature Biotechnology.

[22]  Raphael Gottardo,et al.  Orchestrating high-throughput genomic analysis with Bioconductor , 2015, Nature Methods.

[23]  Sean R. Davis,et al.  GEOquery: a bridge between the Gene Expression Omnibus (GEO) and BioConductor , 2007, Bioinform..

[24]  Avi Ma'ayan,et al.  Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool , 2013, BMC Bioinformatics.

[25]  Sarah K. Kummerfeld,et al.  iheatmapr: Interactive complex heatmaps in R , 2017, J. Open Source Softw..

[26]  M. Robinson,et al.  Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences , 2015, F1000Research.

[27]  Angela N. Brooks,et al.  A Next Generation Connectivity Map: L1000 Platform and the First 1,000,000 Profiles , 2017, Cell.

[28]  AnHai Doan,et al.  MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive , 2017, Bioinform..

[29]  Ravi Iyengar,et al.  The Library of Integrated Network-Based Cellular Signatures NIH Program: System-Level Cataloging of Human Cells Response to Perturbations. , 2017, Cell systems.

[30]  Jing Chen,et al.  ToppGene Suite for gene list enrichment analysis and candidate gene prioritization , 2009, Nucleic Acids Res..

[31]  Lior Pachter,et al.  The Lair: a resource for exploratory analysis of published RNA-Seq data , 2016, bioRxiv.

[32]  Daniel J. Gaffney,et al.  A survey of best practices for RNA-seq data analysis , 2016, Genome Biology.

[33]  Roland Eils,et al.  Complex heatmaps reveal patterns and correlations in multidimensional genomic data , 2016, Bioinform..

[34]  Aaron T. L. Lun,et al.  EGF-mediated induction of Mcl-1 at the switch to lactation is essential for alveolar cell survival , 2015, Nature Cell Biology.

[35]  Birgit Sikkema-Raddatz,et al.  Improving the diagnostic yield of exome- sequencing by predicting gene–phenotype associations using large-scale gene expression analysis , 2018, Nature Communications.