Genetic and Nongenetic Variation Revealed for the Principal Components of Human Gene Expression

Principal components analysis has been employed in gene expression studies to correct for population substructure and batch and environmental effects. This method typically involves the removal of variation contained in as many as 50 principal components (PCs), which can constitute a large proportion of total variation present in the data. Each PC, however, can detect many sources of variation, including gene expression networks and genetic variation influencing transcript levels. We demonstrate that PCs generated from gene expression data can simultaneously contain both genetic and nongenetic factors. From heritability estimates we show that all PCs contain a considerable portion of genetic variation while nongenetic artifacts such as batch effects were associated to varying degrees with the first 60 PCs. These PCs demonstrate an enrichment of biological pathways, including core immune function and metabolic pathways. The use of PC correction in two independent data sets resulted in a reduction in the number of cis- and trans-expression QTL detected. Comparisons of PC and linear model correction revealed that PC correction was not as efficient at removing known batch effects and had a higher penalty on genetic variation. Therefore, this study highlights the danger of eliminating biologically relevant data when employing PC correction in gene expression data.

[1]  Christopher I. Amos,et al.  Genetic Association Analysis of Complex Diseases Incorporating Intermediate Phenotype Information , 2012, PloS one.

[2]  Ole A. Andreassen,et al.  A mutation in APP protects against Alzheimer’s disease and age-related cognitive decline , 2012, Nature.

[3]  Michael R. Kosorok,et al.  Identification of differential gene pathways with principal component analysis , 2009, Bioinform..

[4]  John Quackenbush,et al.  Sources of variation in baseline gene expression levels from toxicogenomics study control animals across multiple laboratories , 2008, BMC Genomics.

[5]  Anton J. Enright,et al.  Extent, Causes, and Consequences of Small RNA Expression Variation in Human Adipose Tissue , 2012, PLoS genetics.

[6]  J. Edward Jackson,et al.  A User's Guide to Principal Components. , 1991 .

[7]  Donald A. Jackson,et al.  How many principal components? stopping rules for determining the number of non-trivial axes revisited , 2005, Comput. Stat. Data Anal..

[8]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[9]  S. Leal Genetics and Analysis of Quantitative Traits , 2001 .

[10]  Andreas Scherer,et al.  Batch Effects and Noise in Microarray Experiments: Sources and Solutions , 2009 .

[11]  David M. Simcha,et al.  Tackling the widespread and critical impact of batch effects in high-throughput data , 2010, Nature Reviews Genetics.

[12]  Nilanjan Chatterjee,et al.  Estimation of effect size distribution from genome-wide association studies and implications for future discoveries , 2010, Nature Genetics.

[13]  R. Ophoff,et al.  Unraveling the Regulatory Mechanisms Underlying Tissue-Dependent Genetic Variation of Gene Expression , 2012, PLoS genetics.

[14]  Leopold Parts,et al.  A Bayesian Framework to Account for Complex Non-Genetic Factors in Gene Expression Levels Greatly Increases Power in eQTL Studies , 2010, PLoS Comput. Biol..

[15]  John D. Storey,et al.  Mapping gene expression quantitative trait loci by singular value decomposition and independent component analysis , 2008, BMC Bioinformatics.

[16]  J. Edward Jackson,et al.  A User's Guide to Principal Components: Jackson/User's Guide to Principal Components , 2004 .

[17]  Neal S. Holter,et al.  Fundamental patterns underlying gene expression profiles: simplicity from complexity. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Chunyu Liu,et al.  Removing Batch Effects in Analysis of Expression Microarray Data: An Evaluation of Six Batch Adjustment Methods , 2011, PloS one.

[19]  M. Hirai,et al.  Omics-based identification of Arabidopsis Myb transcription factors regulating aliphatic glucosinolate biosynthesis , 2007, Proceedings of the National Academy of Sciences.

[20]  Joshua M. Stuart,et al.  MICROARRAY EXPERIMENTS : APPLICATION TO SPORULATION TIME SERIES , 1999 .

[21]  S. Hoyer Oxidative energy metabolism in Alzheimer brain. Studies in early-onset and late-onset cases. , 1992, Molecular and chemical neuropathology.

[22]  A. Scherer Batch Effects and Noise in Microarray Experiments , 2009 .

[23]  T. Chu,et al.  Principal Variance Components Analysis: Estimating Batch Effects in Microarray Gene Expression Data , 2009 .

[24]  B. Fridley,et al.  Identifying the Genetic Variation of Gene Expression Using Gene Sets: Application of Novel Gene Set eQTL Approach to PharmGKB and KEGG , 2012, PloS one.

[25]  Alkes L. Price,et al.  Single-Tissue and Cross-Tissue Heritability of Gene Expression Via Identity-by-Descent in Related or Unrelated Individuals , 2011, PLoS genetics.

[26]  E. Dermitzakis,et al.  Using gene expression to investigate the genetic basis of complex disorders. , 2008, Human molecular genetics.

[27]  Jan Ihmels,et al.  Principles of transcriptional control in the metabolic network of Saccharomyces cerevisiae , 2004, Nature Biotechnology.

[28]  Manuel A. R. Ferreira,et al.  PLINK: a tool set for whole-genome association and population-based linkage analyses. , 2007, American journal of human genetics.

[29]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[30]  G. Churchill Fundamentals of experimental design for cDNA microarrays , 2002, Nature Genetics.

[31]  S. Horvath,et al.  Variations in DNA elucidate molecular networks that cause disease , 2008, Nature.

[32]  Virginia Pascual,et al.  A modular analysis framework for blood genomics studies: application to systemic lupus erythematosus. , 2008, Immunity.

[33]  Hugh G. Gauch,et al.  Noise Reduction By Eigenvector Ordinations , 1982 .

[34]  Kevin R Coombes,et al.  Run batch effects potentially compromise the usefulness of genomic signatures for ovarian cancer. , 2008, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[35]  William A. Schmitt,et al.  Interactive exploration of microarray gene expression patterns in a reduced dimensional space. , 2002, Genome research.

[36]  Simon C. Potter,et al.  Mapping cis- and trans-regulatory effects across multiple tissues in twins , 2012, Nature Genetics.

[37]  M. Beal,et al.  Oxidative damage and metabolic dysfunction in Huntington's disease: Selective vulnerability of the basal ganglia , 1997, Annals of neurology.

[38]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[39]  Jinhee Kim,et al.  Effect of Normalization on Statistical and Biological Interpretation of Gene Expression Profiles , 2013, Front. Genet..

[40]  Ka Yee Yeung,et al.  Principal component analysis for clustering gene expression data , 2001, Bioinform..

[41]  D. Allison,et al.  Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.

[42]  Joseph E. Powell,et al.  The Brisbane Systems Genetics Study: Genetical Genomics Meets Complex Trait Genetics , 2012, PloS one.

[43]  Masanori Arita,et al.  SVD-based Anatomy of Gene Expressions for Correlation Analysis in Arabidopsis thaliana , 2008, DNA research : an international journal for rapid publication of reports on genes and genomes.

[44]  G. Abecasis,et al.  A general test of association for quantitative traits in nuclear families. , 2000, American journal of human genetics.

[45]  Yudong D. He,et al.  Effects of atmospheric ozone on microarray data quality. , 2003, Analytical chemistry.

[46]  Greg Gibson,et al.  Using Blood Informative Transcripts in Geographical Genomics: Impact of Lifestyle on Gene Expression in Fijians , 2012, Front. Gene..

[47]  Gene H. Golub,et al.  Singular value decomposition and least squares solutions , 1970, Milestones in Matrix Computation.

[48]  J. E. Jackson A User's Guide to Principal Components , 1991 .

[49]  Christopher D. Brown,et al.  Integrative Modeling of eQTLs and Cis-Regulatory Elements Suggests Mechanisms Underlying Cell Type Specificity of eQTLs , 2012, PLoS genetics.

[50]  Patrizia Mecocci,et al.  Oxidative damage to mitochondrial DNA is increased in Alzheimer's disease , 1994, Annals of neurology.

[51]  Xiaomin Song,et al.  Amyloid-β and tau synergistically impair the oxidative phosphorylation system in triple transgenic Alzheimer's disease mice , 2009, Proceedings of the National Academy of Sciences.

[52]  Oliver Stegle,et al.  Accounting for Non-genetic Factors Improves the Power of eQTL Studies , 2008, RECOMB.

[53]  Joseph K. Pickrell,et al.  Understanding mechanisms underlying human gene expression variation with RNA sequencing , 2010, Nature.

[54]  David Bryant,et al.  DAVID Bioinformatics Resources: expanded annotation database and novel algorithms to better extract biology from large gene lists , 2007, Nucleic Acids Res..

[55]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[56]  P. Visscher,et al.  Genetic control of gene expression in whole blood and lymphoblastoid cell lines is largely independent. , 2012, Genome research.

[57]  Jingyuan Fu,et al.  Trans-eQTLs Reveal That Independent Genetic Variants Associated with a Complex Phenotype Converge on Intermediate Genes, with a Major Role for the HLA , 2011, PLoS genetics.

[58]  Tieliu Shi,et al.  A comparison of batch effect removal methods for enhancement of prediction performance using MAQC-II microarray gene expression data , 2010, The Pharmacogenomics Journal.

[59]  Jeffrey T Leek,et al.  On the design and analysis of gene expression studies in human populations , 2007, Nature Genetics.

[60]  R. Spielman,et al.  Reply to “On the design and analysis of gene expression studies in human populations” , 2007, Nature Genetics.

[61]  P. Deloukas,et al.  Patterns of Cis Regulatory Variation in Diverse Human Populations , 2012, PLoS genetics.

[62]  John Quackenbush,et al.  Multiple-laboratory comparison of microarray platforms , 2005, Nature Methods.