GSVA: gene set variation analysis for microarray and RNA-Seq data

BackgroundGene set enrichment (GSE) analysis is a popular framework for condensing information from gene expression profiles into a pathway or signature summary. The strengths of this approach over single gene analysis include noise and dimension reduction, as well as greater biological interpretability. As molecular profiling experiments move beyond simple case-control studies, robust and flexible GSE methodologies are needed that can model pathway activity within highly heterogeneous data sets.ResultsTo address this challenge, we introduce Gene Set Variation Analysis (GSVA), a GSE method that estimates variation of pathway activity over a sample population in an unsupervised manner. We demonstrate the robustness of GSVA in a comparison with current state of the art sample-wise enrichment methods. Further, we provide examples of its utility in differential pathway activity and survival analysis. Lastly, we show how GSVA works analogously with data from both microarray and RNA-seq experiments.ConclusionsGSVA provides increased power to detect subtle pathway activity changes over a sample population in comparison to corresponding methods. While GSE methods are generally regarded as end points of a bioinformatic analysis, GSVA constitutes a starting point to build pathway-centric models of biology. Moreover, GSVA contributes to the current need of GSE methods for RNA-seq data. GSVA is an open source software package for R which forms part of the Bioconductor project and can be downloaded at http://www.bioconductor.org.

[1]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[2]  Jun Lu,et al.  Pathway level analysis of gene expression using singular value decomposition , 2005, BMC Bioinformatics.

[3]  Aravind Subramanian,et al.  A zebrafish bmyb mutation causes genome instability and increased cancer susceptibility. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[4]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[5]  Thomas Lengauer,et al.  Improved scoring of functional groups from gene expression data by decorrelating GO graph structure , 2006, Bioinform..

[6]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[7]  Thomas Lengauer,et al.  ROCR: visualizing classifier performance in R , 2005, Bioinform..

[8]  Zhaohui S. Qin,et al.  A second generation human haplotype map of over 3.1 million SNPs , 2007, Nature.

[9]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[10]  E. Lander,et al.  MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia , 2002, Nature Genetics.

[11]  Bertrand Dousset,et al.  Gene expression profiling reveals a new classification of adrenocortical tumors and identifies molecular predictors of malignancy and survival. , 2009, Journal of clinical oncology : official journal of the American Society of Clinical Oncology.

[12]  K. Hansen,et al.  Removing technical variability in RNA-seq data using conditional quantile normalization , 2012, Biostatistics.

[13]  Zhen Jiang,et al.  Bioconductor Project Bioconductor Project Working Papers Year Paper Extensions to Gene Set Enrichment , 2013 .

[14]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[15]  Rafael A Irizarry,et al.  Gene set enrichment analysis made simple , 2009, Statistical methods in medical research.

[16]  R. Tibshirani,et al.  On testing the significance of sets of genes , 2006, math/0610667.

[17]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[18]  Egon S. Pearson,et al.  Comparison of tests for randomness of points on a line , 1963 .

[19]  Matthew E Ritchie,et al.  Integrative analysis of RUNX1 downstream pathways and target genes , 2008, BMC Genomics.

[20]  Antonio Canale,et al.  Bayesian Kernel Mixtures for Counts , 2011, Journal of the American Statistical Association.

[21]  Atul J. Butte,et al.  Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges , 2012, PLoS Comput. Biol..

[22]  Dørum Guro,et al.  Rotation testing in gene set enrichment analysis for small direct comparison experiments. , 2009 .

[23]  Ben S. Wittner,et al.  Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1 , 2009, Nature.

[24]  William C Hahn,et al.  Functional genetics and experimental models of human cancer. , 2004, Trends in molecular medicine.

[25]  Benjamin J. Raphael,et al.  Integrated Genomic Analyses of Ovarian Carcinoma , 2011, Nature.

[26]  P. Pelicci,et al.  Biological and Molecular Heterogeneity of Breast Cancers Correlates with Their Cancer Stem Cell Content , 2010, Cell.

[27]  J. Castle,et al.  expression data: the tissue distribution of human pathways , 2006 .

[28]  J. Castle,et al.  An integrative genomics approach to infer causal associations between gene expression and disease , 2005, Nature Genetics.

[29]  Xiao-cao Shen,et al.  Estrogen receptor expression in adrenocortical carcinoma , 2009, Journal of Zhejiang University SCIENCE B.

[30]  T. Graves,et al.  The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes , 2003, Nature.

[31]  T. Speed,et al.  Summaries of Affymetrix GeneChip probe level data. , 2003, Nucleic acids research.

[32]  Seon-Young Kim,et al.  Gene-set approach for expression pattern analysis , 2008, Briefings Bioinform..

[33]  P. J. Green,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[34]  A. Butte,et al.  Expression-based Pathway Signature Analysis (EPSA): Mining publicly available microarray data for insight into human disease , 2008, BMC Medical Genomics.

[35]  R. Irizarry,et al.  A gene expression bar code for microarray data , 2007, Nature Methods.

[36]  S. Gabriel,et al.  Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. , 2010, Cancer cell.

[37]  J. Bertherat,et al.  Transcriptome analysis of adrenocortical cancers: from molecular classification to the identification of new treatments. , 2011, Endocrine-related cancer.

[38]  Zhiping Weng,et al.  Gene set enrichment analysis: performance evaluation and usage guidelines , 2012, Briefings Bioinform..

[39]  Mingming Jia,et al.  COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer , 2010, Nucleic Acids Res..

[40]  Peter Bühlmann,et al.  Analyzing gene expression data in terms of gene sets: methodological issues , 2007, Bioinform..

[41]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[42]  J. Bertherat,et al.  Adrenocortical cancer: pathophysiology and clinical management. , 2007, Endocrine-related cancer.

[43]  C. Creighton Multiple Oncogenic Pathway Signatures Show Coordinate Expression Patterns in Human Prostate Tumors , 2008, PloS one.

[44]  R. Durbin,et al.  Joint Genetic Analysis of Gene Expression Data with Inferred Cellular Phenotypes , 2011, PLoS genetics.

[45]  Edgar Brunner,et al.  Comparison of global tests for functional gene sets in two-group designs and selection of potentially effect-causing genes , 2011, Bioinform..

[46]  Guro Dørum,et al.  Rotation Testing in Gene Set Enrichment Analysis for Small Direct Comparison Experiments , 2009, Statistical applications in genetics and molecular biology.

[47]  H. Willard,et al.  X-inactivation profile reveals extensive variability in X-linked gene expression in females , 2005, Nature.

[48]  B. Williams,et al.  Mapping and quantifying mammalian transcriptomes by RNA-Seq , 2008, Nature Methods.

[49]  Doheon Lee,et al.  Inferring Pathway Activity toward Precise Disease Classification , 2008, PLoS Comput. Biol..

[50]  Seon-Young Kim,et al.  PAGE: Parametric Analysis of Gene Set Enrichment , 2005, BMC Bioinform..

[51]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[52]  Sayan Mukherjee,et al.  Analysis of sample set enrichment scores: assaying the enrichment of sets of genes for individual samples in genome-wide expression profiles , 2006, ISMB.

[53]  Di Wu,et al.  ROAST: rotation gene set tests for complex microarray experiments , 2010, Bioinform..

[54]  Joseph K. Pickrell,et al.  Understanding mechanisms underlying human gene expression variation with RNA sequencing , 2010, Nature.

[55]  J. Mesirov,et al.  An oncogenic KRAS2 expression signature identified by cross-species gene-expression analysis , 2005, Nature Genetics.

[56]  S. Roman,et al.  Adrenocortical carcinoma , 2006, Current opinion in oncology.

[57]  Jelle J. Goeman,et al.  A global test for groups of genes: testing association with a clinical outcome , 2004, Bioinform..

[58]  Ayellet V. Segrè,et al.  Common Inherited Variation in Mitochondrial Genes Is Not Enriched for Associations with Type 2 Diabetes or Related Glycemic Traits , 2010, PLoS genetics.

[59]  R. Gentleman,et al.  Independent filtering increases detection power for high-throughput experiments , 2010, Proceedings of the National Academy of Sciences.

[60]  P. Park,et al.  Discovering statistically significant pathways in expression profiling studies. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[61]  Purvesh Khatri,et al.  Ontological analysis of gene expression data: current tools, limitations, and open problems , 2005, Bioinform..

[62]  J. Park,et al.  Antiproliferative mechanism of retinoid derivatives in ovarian cancer cells. , 2001, Cancer letters.

[63]  Ben Bolstad,et al.  Low-level Analysis of High-density Oligonucleotide Array Data: Background, Normalization and Summarization , 2003 .

[64]  Bernard W. Silverman,et al.  Density Estimation for Statistics and Data Analysis , 1987 .

[65]  T. Golub,et al.  A Mechanism of Cyclin D1 Action Encoded in the Patterns of Gene Expression in Human Cancer , 2003, Cell.

[66]  J. Mesirov,et al.  Gene Set Enrichment Analysis Made Right , 2011 .

[67]  Andrew B. Nobel,et al.  Significance analysis of functional categories in gene expression studies: a structured permutation approach , 2005, Bioinform..

[68]  R. Tibshirani,et al.  Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data , 2004, PLoS biology.

[69]  Cheng Li,et al.  Adjusting batch effects in microarray expression data using empirical Bayes methods. , 2007, Biostatistics.

[70]  Paul G. Gauger,et al.  Molecular Classification and Prognostication of Adrenocortical Tumors by Transcriptome Profiling , 2009, Clinical Cancer Research.

[71]  E. Purev,et al.  Rb2/p130 and protein phosphatase 2A: key mediators of ovarian carcinoma cell growth suppression by all-trans retinoic acid , 2006, Oncogene.

[72]  Martin Fassnacht,et al.  Adrenocortical carcinoma: a clinician's update , 2011, Nature Reviews Endocrinology.

[73]  Matthew D. Young,et al.  Gene ontology analysis for RNA-seq: accounting for selection bias , 2010, Genome Biology.

[74]  M. Eileen Dolan,et al.  A genome-wide approach to identify genetic variants that contribute to etoposide-induced cytotoxicity , 2007, Proceedings of the National Academy of Sciences.