variancePartition: interpreting drivers of variation in complex gene expression studies

BackgroundAs large-scale studies of gene expression with multiple sources of biological and technical variation become widely adopted, characterizing these drivers of variation becomes essential to understanding disease biology and regulatory genetics.ResultsWe describe a statistical and visualization framework, variancePartition, to prioritize drivers of variation based on a genome-wide summary, and identify genes that deviate from the genome-wide trend. Using a linear mixed model, variancePartition quantifies variation in each expression trait attributable to differences in disease status, sex, cell or tissue type, ancestry, genetic background, experimental stimulus, or technical variables. Analysis of four large-scale transcriptome profiling datasets illustrates that variancePartition recovers striking patterns of biological and technical variation that are reproducible across multiple datasets.ConclusionsOur open source software, variancePartition, enables rapid interpretation of complex gene expression studies as well as other high-throughput genomics assays. variancePartition is available from Bioconductor: http://bioconductor.org/packages/variancePartition.

[1]  M. Robinson,et al.  Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences , 2015, F1000Research.

[2]  Jun S. Liu,et al.  The Genotype-Tissue Expression (GTEx) pilot analysis: Multitissue gene regulation in humans , 2015, Science.

[3]  David Heckerman,et al.  Correction for hidden confounders in the genetic analysis of gene expression , 2010, Proceedings of the National Academy of Sciences.

[4]  Matti Pirinen,et al.  Efficient computation with a linear mixed model on large-scale data sets with applications to genetic studies , 2012, 1207.4886.

[5]  V. Carey,et al.  Mixed-Effects Models in S and S-Plus , 2001 .

[6]  Eleazar Eskin,et al.  Improved linear mixed models for genome-wide association studies , 2012, Nature Methods.

[7]  Sheng Li,et al.  Multi-platform assessment of transcriptome profiling using RNA-seq in the ABRF next-generation sequencing study , 2014, Nature Biotechnology.

[8]  Sandrine Dudoit,et al.  GC-Content Normalization for RNA-Seq Data , 2011, BMC Bioinformatics.

[9]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[10]  M. Stephens,et al.  Genome-wide Efficient Mixed Model Analysis for Association Studies , 2012, Nature Genetics.

[11]  J. Pritchard,et al.  Genetic Variation, Not Cell Type of Origin, Underlies the Majority of Identifiable Regulatory Differences in iPSCs , 2016, PLoS genetics.

[12]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[13]  D. Harville Maximum Likelihood Approaches to Variance Component Estimation and to Related Problems , 1977 .

[14]  Pedro G. Ferreira,et al.  Transcriptome and genome sequencing uncovers functional variation in humans , 2013, Nature.

[15]  Paul Theodor Pyl,et al.  HTSeq—a Python framework to work with high-throughput sequencing data , 2014, bioRxiv.

[16]  L. Tran,et al.  Integrated Systems Approach Identifies Genetic Nodes and Networks in Late-Onset Alzheimer’s Disease , 2013, Cell.

[17]  Shinichi Nakagawa,et al.  Repeatability for Gaussian and non‐Gaussian data: a practical guide for biologists , 2010, Biological reviews of the Cambridge Philosophical Society.

[18]  J. Ware,et al.  Random-effects models for longitudinal data. , 1982, Biometrics.

[19]  D. Botstein,et al.  Singular value decomposition for genome-wide expression data processing and modeling. , 2000, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Daniel J. Gaffney,et al.  Genetic Background Drives Transcriptional Variation in Human Induced Pluripotent Stem Cells , 2014, PLoS genetics.

[21]  Hadley Wickham,et al.  ggplot2 - Elegant Graphics for Data Analysis (2nd Edition) , 2017 .

[22]  Geet Duggal,et al.  Salmon provides accurate, fast, and bias-aware transcript expression estimates using dual-phase inference , 2015, bioRxiv.

[23]  Steve Weston,et al.  Foreach Parallel Adaptor for the 'parallel' Package , 2015 .

[24]  Rob Patro,et al.  Sailfish enables alignment-free isoform quantification from RNA-seq reads using lightweight algorithms , 2013, Nature Biotechnology.

[25]  Daphne Koller,et al.  Polarization of the Effects of Autoimmune and Neurodegenerative Risk Alleles in Leukocytes , 2014, Science.

[26]  D. Bates,et al.  Fitting Linear Mixed-Effects Models Using lme4 , 2014, 1406.5823.

[27]  David P. Kreil,et al.  Assessing technical performance in differential gene expression experiments with external spike-in RNA control ratio mixtures , 2014, Nature Communications.

[28]  M. Robinson,et al.  Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. , 2015, F1000Research.

[29]  Gordon K Smyth,et al.  Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2004, Statistical applications in genetics and molecular biology.

[30]  Wei Shi,et al.  featureCounts: an efficient general purpose program for assigning sequence reads to genomic features , 2013, Bioinform..

[31]  Alyssa C. Frazee,et al.  Ballgown bridges the gap between transcriptome assembly and expression analysis , 2015, Nature Biotechnology.

[32]  A. Gelman Analysis of variance: Why it is more important than ever? , 2005, math/0504499.

[33]  Rafael A. Irizarry,et al.  A framework for oligonucleotide microarray preprocessing , 2010, Bioinform..

[34]  David P. Kreil,et al.  A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control consortium , 2014, Nature Biotechnology.

[35]  P. Visscher,et al.  Advantages and pitfalls in the application of mixed-model association methods , 2014, Nature Genetics.

[36]  Manolis Kellis,et al.  Common Genetic Variants Modulate Pathogen-Sensing Responses in Human Dendritic Cells , 2014, Science.

[37]  Robert Patro,et al.  Sailfish: Alignment-free Isoform Quantification from RNA-seq Reads using Lightweight Algorithms , 2013, ArXiv.

[38]  H. Kang,et al.  Variance component model to account for sample structure in genome-wide association studies , 2010, Nature Genetics.

[39]  David G Hendrickson,et al.  Differential analysis of gene regulation at transcript resolution with RNA-seq , 2012, Nature Biotechnology.

[40]  Y. Gilad,et al.  RNA-seq: impact of RNA degradation on transcript quantification , 2014, BMC Biology.

[41]  Jeroen F. J. Laros,et al.  Reproducibility of high-throughput mRNA and small RNA sequencing across laboratories , 2013, Nature Biotechnology.

[42]  Wei Shi,et al.  Detecting and correcting systematic variation in large-scale RNA sequencing data , 2014, Nature Biotechnology.

[43]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[44]  John D. Storey,et al.  Gene-expression variation within and among human populations. , 2007, American journal of human genetics.

[45]  T. Speed,et al.  Summaries of Affymetrix GeneChip probe level data. , 2003, Nucleic acids research.

[46]  Alan Y. Chiang,et al.  Generalized Additive Models: An Introduction With R , 2007, Technometrics.

[47]  Kevin Fiedler,et al.  Likelihood Bayesian And Mcmc Methods In Quantitative Genetics , 2016 .

[48]  Steve Weston,et al.  Provides Foreach Looping Construct for R , 2015 .

[49]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[50]  Xuegong Zhang,et al.  mRIN for direct assessment of genome-wide and gene-specific mRNA integrity from large-scale RNA-sequencing data , 2015, Nature Communications.

[51]  R. Andrews,et al.  Innate Immune Activity Conditions the Effect of Regulatory Variants upon Monocyte Gene Expression , 2014, Science.

[52]  Ying Liu,et al.  FaST linear mixed models for genome-wide association studies , 2011, Nature Methods.

[53]  S. Wood Generalized Additive Models: An Introduction with R , 2006 .

[54]  Lior Pachter,et al.  Near-optimal probabilistic RNA-seq quantification , 2016, Nature Biotechnology.

[55]  A. Singleton,et al.  Genetic variability in the regulation of gene expression in ten regions of the human brain , 2014, Nature Neuroscience.

[56]  D. Botstein,et al.  Cluster analysis and display of genome-wide expression patterns. , 1998, Proceedings of the National Academy of Sciences of the United States of America.

[57]  Steve Weston,et al.  Provides Iterator Construct for R , 2015 .

[58]  Dmitri D. Pervouchine,et al.  The human transcriptome across tissues and individuals , 2015, Science.

[59]  Cole Trapnell,et al.  Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. , 2010, Nature biotechnology.

[60]  Peter C. Thomson,et al.  Analysis of gene expression data using a linear mixed model/finite mixture model approach: application to regional differences in the human brain , 2014, Bioinform..

[61]  P. Visscher,et al.  Common SNPs explain a large proportion of heritability for human height , 2011 .