Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis

It has unambiguously been shown that genetic, environmental, demographic, and technical factors may have substantial effects on gene expression levels. In addition to the measured variable(s) of interest, there will tend to be sources of signal due to factors that are unknown, unmeasured, or too complicated to capture through simple models. We show that failing to incorporate these sources of heterogeneity into an analysis can have widespread and detrimental effects on the study. Not only can this reduce power or induce unwanted dependence across genes, but it can also introduce sources of spurious signal to many genes. This phenomenon is true even for well-designed, randomized studies. We introduce “surrogate variable analysis” (SVA) to overcome the problems caused by heterogeneity in expression studies. SVA can be applied in conjunction with standard analysis techniques to accurately capture the relationship between expression and any modeled variables of interest. We apply SVA to disease class, time course, and genetics of gene expression studies. We show that SVA increases the biological accuracy and reproducibility of analyses in genome-wide expression studies.

[1]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[2]  E. Lehmann Testing Statistical Hypotheses , 1960 .

[3]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[4]  A. Buja,et al.  Remarks on Parallel Analysis. , 1992, Multivariate behavioral research.

[5]  P. Brown,et al.  Exploring the metabolic and genetic control of gene expression on a genomic scale. , 1997, Science.

[6]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[7]  M. Bittner,et al.  Fluorescent cDNA microarray hybridization reveals complexity and heterogeneity of cellular genotoxic stress responses , 1999, Oncogene.

[8]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[9]  Gary A. Churchill,et al.  Analysis of Variance for Gene Expression Microarray Data , 2000, J. Comput. Biol..

[10]  Neal S. Holter,et al.  Fundamental patterns underlying gene expression profiles: simplicity from complexity. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[11]  D. Botstein,et al.  Genomic expression programs in the response of yeast cells to environmental changes. , 2000, Molecular biology of the cell.

[12]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[13]  G. Churchill,et al.  Experimental design for gene expression microarrays. , 2001, Biostatistics.

[14]  J. Sudbø,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[15]  M. Oh,et al.  Issues in cDNA microarray analysis: quality filtering, channel normalization, models of variations and assessment of gene effects. , 2001, Nucleic acids research.

[16]  S. Dudoit,et al.  Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. , 2002, Nucleic acids research.

[17]  John D. Storey A direct approach to false discovery rates , 2002 .

[18]  L. Kruglyak,et al.  Genetic Dissection of Transcriptional Regulation in Budding Yeast , 2002, Science.

[19]  Albert-László Barabási,et al.  Genetic Dissection of Transcriptional Regulation in Budding Yeast , 2002 .

[20]  B. Efron Large-Scale Simultaneous Hypothesis Testing , 2004 .

[21]  Rachel B. Brem,et al.  Trans-acting regulatory variation in Saccharomyces cerevisiae and the role of transcription factors , 2003, Nature Genetics.

[22]  R. Stoughton,et al.  Genetics of gene expression surveyed in maize, mouse and man , 2003, Nature.

[23]  M. Ringnér,et al.  Molecular classification of familial non-BRCA1/BRCA2 breast cancer , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[24]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[25]  John D. Storey,et al.  Statistical Significance for Genome-Wide Studies , 2003 .

[26]  Lingli Wang,et al.  A Transcriptional Profile of Aging in the Human Kidney , 2004, PLoS biology.

[27]  C. Molony,et al.  Genetic analysis of genome-wide variation in human gene expression , 2004, Nature.

[28]  B. Efron Correlation and Large-Scale Simultaneous Significance Testing , 2007 .

[29]  Jan-Fang Cheng,et al.  Loss of silent-chromatin looping and impaired imprinting of DLX5 in Rett syndrome , 2005, Nature Genetics.

[30]  Stephen E. Fienberg,et al.  Testing Statistical Hypotheses , 2005 .

[31]  John D. Storey,et al.  Significance analysis of time course microarray experiments. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[32]  A. Chinnaiyan,et al.  Integrative analysis of the cancer transcriptome , 2005, Nature Genetics.

[33]  Xing Qiu,et al.  Assessing stability of gene selection in microarray data analysis , 2006, BMC Bioinformatics.

[34]  A. Owen Variance of the number of false discoveries , 2005 .

[35]  John D. Storey,et al.  Multiple Locus Linkage Analysis of Genomewide Expression in Yeast , 2005, PLoS biology.

[36]  Xing Qiu,et al.  Correlation Between Gene Expression Levels and Limitations of the Empirical Bayes Methodology for Finding Differentially Expressed Genes , 2005, Statistical applications in genetics and molecular biology.

[37]  John D. Storey,et al.  Genetic interactions between polymorphisms that affect gene expression in yeast , 2005, Nature.

[38]  Xing Qiu,et al.  Some Comments on Instability of False Discovery Rate Estimation , 2006, J. Bioinform. Comput. Biol..

[39]  Paul A Clemons,et al.  The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease , 2006, Science.

[40]  John D. Storey,et al.  Lymphocyte Anergy in Patients with Carcinoma , 1973, British Journal of Cancer.

[41]  Frank Speleman,et al.  Imperfect protection: NEPA at 35 years. , 2004, Genome Biology.

[42]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[43]  S. Hilsenbeck,et al.  Molecular Heterogeneity of Inflammatory Breast Cancer: A Hyperproliferative Phenotype , 2006, Clinical Cancer Research.

[44]  S. Sarkar,et al.  Modified Simes’ critical values under positive dependence , 2006 .

[45]  Andrei Yakovlev,et al.  Treating Expression Levels of Different Genes as a Sample in Microarray Data Analysis: Is it Worth a Risk? , 2006, Statistical applications in genetics and molecular biology.

[46]  Yudi Pawitan,et al.  Estimation of false discovery proportion under general dependence , 2006, Bioinform..

[47]  John D. Storey,et al.  A new approach to intensity-dependent normalization of two-channel microarrays. , 2007, Biostatistics.

[48]  M. Newton Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis , 2008 .