How data analysis affects power, reproducibility and biological insight of RNA-seq studies in complex datasets

The sequencing of the full transcriptome (RNA-seq) has become the preferred choice for the measurement of genome-wide gene expression. Despite its widespread use, challenges remain in RNA-seq data analysis. One often-overlooked aspect is normalization. Despite the fact that a variety of factors or ‘batch effects’ can contribute unwanted variation to the data, commonly used RNA-seq normalization methods only correct for sequencing depth. The study of gene expression is particularly problematic when it is influenced simultaneously by a variety of biological factors in addition to the one of interest. Using examples from experimental neuroscience, we show that batch effects can dominate the signal of interest; and that the choice of normalization method affects the power and reproducibility of the results. While commonly used global normalization methods are not able to adequately normalize the data, more recently developed RNA-seq normalization can. We focus on one particular method, RUVSeq and show that it is able to increase power and biological insight of the results. Finally, we provide a tutorial outlining the implementation of RUVSeq normalization that is applicable to a broad range of studies as well as meta-analysis of publicly available data.

[1]  Pierre Baldi,et al.  The Neuron-specific Chromatin Regulatory Subunit BAF53b is Necessary for Synaptic Plasticity and Memory , 2013, Nature Neuroscience.

[2]  B W Agranoff,et al.  Actinomycin D Blocks Formation of Memory of Shock-Avoidance in Goldfish , 1967, Science.

[3]  G. Hong,et al.  Nucleic Acids Research , 2015, Nucleic Acids Research.

[4]  K. Giese,et al.  Memory Reconsolidation Engages Only a Subset of Immediate-Early Genes Induced during Consolidation , 2005, The Journal of Neuroscience.

[5]  Alcino J. Silva,et al.  Memory Reconsolidation and Extinction Have Distinct Temporal and Biochemical Signatures , 2004, The Journal of Neuroscience.

[6]  Nancy R. Zhang,et al.  Memory acquisition and retrieval impact different epigenetic processes that regulate gene expression , 2015, BMC Genomics.

[7]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[8]  Manolis Kellis,et al.  Conserved epigenomic signals in mice and humans reveal immune basis of Alzheimer’s disease , 2015, Nature.

[9]  Terence P. Speed,et al.  Quality Assessment of Affymetrix GeneChip Data , 2005 .

[10]  Sandrine Dudoit,et al.  Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments , 2010, BMC Bioinformatics.

[11]  Sarah A. Stern,et al.  The effect of insulin and insulin-like growth factors on hippocampus- and amygdala-dependent long-term memory formation , 2014, Learning & memory.

[12]  Nicolas Servant,et al.  A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis , 2013, Briefings Bioinform..

[13]  Brad T. Sherman,et al.  DAVID: Database for Annotation, Visualization, and Integrated Discovery , 2003, Genome Biology.

[14]  J. Davis Bioinformatics and Computational Biology Solutions Using R and Bioconductor , 2007 .

[15]  M. Bucan,et al.  Promoter features related to tissue specificity as measured by Shannon entropy , 2005, Genome Biology.

[16]  B. Albensi,et al.  NF-κB p50 subunit knockout impairs late LTP and alters long term memory in the mouse hippocampus , 2012, BMC Neuroscience.

[17]  T. Abel,et al.  Differential transcriptional response to nonassociative and associative components of classical fear conditioning in the amygdala and hippocampus. , 2006, Learning & memory.

[18]  Annie Vogel-Ciernia,et al.  Examining Object Location and Object Recognition Memory in Mice , 2014, Current protocols in neuroscience.

[19]  Thomas D. Wu,et al.  GMAP: a genomic mapping and alignment program for mRNA and EST sequence , 2005, Bioinform..

[20]  Min Zhuo,et al.  The JAK/STAT Pathway Is Involved in Synaptic Plasticity , 2012, Neuron.

[21]  L. Reijmers,et al.  Functionally diverse dendritic mRNAs rapidly associate with ribosomes following a novel experience , 2014, Nature Communications.

[22]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[23]  Daniel R. Zerbino,et al.  Ensembl 2014 , 2013, Nucleic Acids Res..

[24]  Johann A. Gagnon-Bartsch,et al.  Using control genes to correct for unwanted variation in microarray data. , 2012, Biostatistics.

[25]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[26]  Davis J. McCarthy,et al.  Count-based differential expression analysis of RNA sequencing data using R and Bioconductor , 2013, Nature Protocols.

[27]  S. Dudoit,et al.  Normalization of RNA-seq data using factor analysis of control genes or samples , 2014, Nature Biotechnology.

[28]  J. David Sweatt,et al.  The MAPK cascade is required for mammalian associative learning , 1998, Nature Neuroscience.

[29]  K. Thomas,et al.  Quantitatively and qualitatively different cellular processes are engaged in CA1 during the consolidation and reconsolidation of contextual fear memory , 2012, Hippocampus.

[30]  I. Izquierdo,et al.  Two Time Periods of Hippocampal mRNA Synthesis Are Required for Memory Consolidation of Fear-Motivated Learning , 2002, The Journal of Neuroscience.

[31]  J. Sweatt,et al.  A Bioinformatics Analysis of Memory Consolidation Reveals Involvement of the Transcription Factor c-Rel , 2004, The Journal of Neuroscience.

[32]  M. Robinson,et al.  A scaling normalization method for differential expression analysis of RNA-seq data , 2010, Genome Biology.

[33]  K. Reymann,et al.  K‐Lysine acetyltransferase 2a regulates a hippocampal gene expression network linked to memory formation , 2014, The EMBO journal.

[34]  Rafael A. Irizarry,et al.  Bioinformatics and Computational Biology Solutions using R and Bioconductor , 2005 .

[35]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[36]  A. Barco,et al.  Blocking miRNA Biogenesis in Adult Forebrain Neurons Enhances Seizure Susceptibility, Fear Memory, and Food Intake by Increasing Neuronal Responsiveness. , 2016, Cerebral cortex.

[37]  Jeroen F. J. Laros,et al.  Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories , 2013, Nature Biotechnology.

[38]  Leopold Parts,et al.  A Bayesian Framework to Account for Complex Non-Genetic Factors in Gene Expression Levels Greatly Increases Power in eQTL Studies , 2010, PLoS Comput. Biol..

[39]  T. Abel,et al.  NR4A nuclear receptors support memory enhancement by histone deacetylase inhibitors. , 2012, The Journal of clinical investigation.

[40]  N. Tronson,et al.  Modulation of learning and memory by cytokines: Signaling mechanisms and long term consequences , 2014, Neurobiology of Learning and Memory.

[41]  K. Obrietan,et al.  CREB: a multifaceted regulator of neuronal plasticity and protection , 2011, Journal of neurochemistry.

[42]  S. Bonn,et al.  De-regulation of gene expression and alternative splicing affects distinct cellular pathways in the aging hippocampus , 2014, Front. Cell. Neurosci..

[43]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[44]  J. Leek svaseq: removing batch effects and other unwanted noise from sequencing data , 2014, bioRxiv.

[45]  J. David Sweatt,et al.  Histone H2A.Z subunit exchange controls consolidation of recent and remote memory , 2014, Nature.

[46]  Paul Theodor Pyl,et al.  HTSeq—a Python framework to work with high-throughput sequencing data , 2014, bioRxiv.

[47]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[48]  Sandrine Dudoit,et al.  GC-Content Normalization for RNA-Seq Data , 2011, BMC Bioinformatics.

[49]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..