Practical impacts of genomic data “cleaning” on biological discovery using surrogate variable analysis

BackgroundGenomic data production is at its highest level and continues to increase, making available novel primary data and existing public data to researchers for exploration. Here we explore the consequences of “batch” correction for biological discovery in two publicly available expression datasets. We consider this to include the estimation of and adjustment for wide-spread systematic heterogeneity in genomic measurements that is unrelated to the effects under study, whether it be technical or biological in nature.MethodsWe present three illustrative data analyses using surrogate variable analysis (SVA) and describe how to perform artifact discovery in light of natural heterogeneity within biological groups, secondary biological questions of interest, and non-linear treatment effects in a dataset profiling differentiating pluripotent cells (GSE32923) and another from human brain tissue (GSE30272).ResultsCareful specification of biological effects of interest is very important to factor-based approaches like SVA. We demonstrate greatly sharpened global and gene-specific differential expression across treatment groups in stem cell systems. Similarly, we demonstrate how to preserve major non-linear effects of age across the lifespan in the brain dataset. However, the gains in precisely defining known effects of interest come at the cost of much other information in the “cleaned” data, including sex, common copy number effects and sample or cell line-specific molecular behavior.ConclusionsOur analyses indicate that data “cleaning” can be an important component of high-throughput genomic data analysis when interrogating explicitly defined effects in the context of data affected by robust technical artifacts. However, caution should be exercised to avoid removing biological signal of interest. It is also important to note that open data exploration is not possible after such supervised “cleaning”, because effects beyond those stipulated by the researcher may have been removed. With the goal of making these statistical algorithms more powerful and transparent to researchers in the biological sciences, we provide exploratory plots and accompanying R code for identifying and guiding “cleaning” process (https://github.com/andrewejaffe/StemCellSVA). The impact of these methods is significant enough that we have made newly processed data available for the brain data set at http://braincloud.jhmi.edu/plots/ and GSE30272.

[1]  Johann A. Gagnon-Bartsch,et al.  Using control genes to correct for unwanted variation in microarray data. , 2012, Biostatistics.

[2]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[3]  S. Landi Mammalian class theta GST and differential susceptibility to carcinogens: a review. , 2000, Mutation research.

[4]  J. Leek,et al.  Temporal dynamics and genetic control of transcription in the human prefrontal cortex , 2011, Nature.

[5]  Jeffrey T Leek,et al.  Significance analysis and statistical dissection of variably methylated regions. , 2012, Biostatistics.

[6]  S. Dudoit,et al.  Normalization of RNA-seq data using factor analysis of control genes or samples , 2014, Nature Biotechnology.

[7]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[8]  D. Price,et al.  The role of Pax6 in forebrain development , 2011, Developmental neurobiology.

[9]  Christophe G. Lambert,et al.  Learning from our GWAS mistakes: from experimental design to scientific method , 2012, Biostatistics.

[10]  Amarendra S. Yavatkar,et al.  StemCellDB: the human pluripotent stem cell database at the National Institutes of Health. , 2013, Stem cell research.

[11]  Terence P. Speed,et al.  How data analysis affects power, reproducibility and biological insight of RNA-seq studies in complex datasets , 2015, Nucleic acids research.

[12]  E. Eichler,et al.  Linkage Disequilibrium between Two High-Frequency Deletion Polymorphisms: Implications for Association Studies Involving the glutathione-S transferase (GST) Genes , 2009, PLoS genetics.

[13]  J. Davis Bioinformatics and Computational Biology Solutions Using R and Bioconductor , 2007 .

[14]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[15]  Jeffrey T Leek,et al.  Bump hunting to identify differentially methylated regions in epigenetic epidemiology studies. , 2012, International journal of epidemiology.

[16]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[17]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[18]  E. Levanon,et al.  Human housekeeping genes are compact. , 2003, Trends in genetics : TIG.

[19]  S. Dudoit,et al.  Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. , 2002, Nucleic acids research.

[20]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[21]  J. Cavanaugh Biostatistics , 2005, Definitions.

[22]  Andrew E. Jaffe,et al.  Bioinformatics Applications Note Gene Expression the Sva Package for Removing Batch Effects and Other Unwanted Variation in High-throughput Experiments , 2022 .