Toward a gold standard for benchmarking gene set enrichment analysis

MOTIVATION Although gene set enrichment analysis has become an integral part of high-throughput gene expression data analysis, the assessment of enrichment methods remains rudimentary and ad hoc. In the absence of suitable gold standards, evaluations are commonly restricted to selected datasets and biological reasoning on the relevance of resulting enriched gene sets. RESULTS We develop an extensible framework for reproducible benchmarking of enrichment methods based on defined criteria for applicability, gene set prioritization and detection of relevant processes. This framework incorporates a curated compendium of 75 expression datasets investigating 42 human diseases. The compendium features microarray and RNA-seq measurements, and each dataset is associated with a precompiled GO/KEGG relevance ranking for the corresponding disease under investigation. We perform a comprehensive assessment of 10 major enrichment methods, identifying significant differences in runtime and applicability to RNA-seq data, fraction of enriched gene sets depending on the null hypothesis tested and recovery of the predefined relevance rankings. We make practical recommendations on how methods originally developed for microarray data can efficiently be applied to RNA-seq data, how to interpret results depending on the type of gene set test conducted and which methods are best suited to effectively prioritize gene sets with high phenotype relevance. AVAILABILITY http://bioconductor.org/packages/GSEABenchmarkeR. CONTACT ludwig.geistlinger@sph.cuny.edu.

[1]  Doron Lancet,et al.  MalaCards: A Comprehensive Automatically‐Mined Database of Human Diseases , 2014, Current protocols in bioinformatics.

[2]  Patrick K. Kimes,et al.  A practical guide to methods controlling false discoveries in computational biology , 2019, Genome Biology.

[3]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[4]  David R. Kelley,et al.  Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks , 2012, Nature Protocols.

[5]  Atul J. Butte,et al.  Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges , 2012, PLoS Comput. Biol..

[6]  W. Huber,et al.  Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2 , 2014, Genome Biology.

[7]  Günter P. Wagner,et al.  Measurement of mRNA abundance using RNA-seq data: RPKM measure is inconsistent among samples , 2012, Theory in Biosciences.

[8]  Rafael A Irizarry,et al.  Gene set enrichment analysis made simple , 2009, Statistical methods in medical research.

[9]  Israel Steinfeld,et al.  BMC Bioinformatics BioMed Central , 2008 .

[10]  Joanna Polanska,et al.  Ranking metrics in gene set enrichment analysis: do they matter? , 2017, BMC Bioinformatics.

[11]  G. Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Permutation P -values Should Never Be Zero: Calculating Exact P -values When Permutations Are Randomly Drawn , 2011 .

[12]  Cristina Mitrea,et al.  Methods and approaches in the topology-based analysis of biological pathways , 2013, Front. Physiol..

[13]  Sorin Draghici,et al.  Identifying significantly impacted pathways: a comprehensive review and assessment , 2019, Genome Biology.

[14]  Nicola J. Mulder,et al.  From sets to graphs: towards a realistic enrichment analysis of transcriptomic systems , 2011, Bioinform..

[15]  Sergei Egorov,et al.  Pathway studio - the analysis and navigation of molecular networks , 2003, Bioinform..

[16]  Melissa J. Davis,et al.  Single sample scoring of molecular phenotypes , 2018, BMC Bioinformatics.

[17]  Christina Backes,et al.  GeneTrail—advanced gene set enrichment analysis , 2007, Nucleic Acids Res..

[18]  G. Smyth,et al.  Camera: a competitive gene set test accounting for inter-gene correlation , 2012, Nucleic acids research.

[19]  Di Wu,et al.  ROAST: rotation gene set tests for complex microarray experiments , 2010, Bioinform..

[20]  Tim Beißbarth,et al.  Comparative study on gene set and pathway topology-based enrichment methods , 2015, BMC Bioinformatics.

[21]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[22]  Steven J. M. Jones,et al.  Comprehensive Characterization of Cancer Driver Genes and Mutations , 2018, Cell.

[23]  Jelle J. Goeman,et al.  A global test for groups of genes: testing association with a clinical outcome , 2004, Bioinform..

[24]  Annarita D'Addabbo,et al.  Comparative study of gene set enrichment methods , 2009, BMC Bioinformatics.

[25]  Matthew D. Young,et al.  Gene ontology analysis for RNA-seq: accounting for selection bias , 2010, Genome Biology.

[26]  Frank Emmert-Streib,et al.  Comparative evaluation of gene set analysis approaches for RNA-Seq data , 2014, BMC Bioinformatics.

[27]  S. Drăghici,et al.  Network‐Based Approaches for Pathway Level Analysis , 2018, Current protocols in bioinformatics.

[28]  Jing Chen,et al.  ToppGene Suite for gene list enrichment analysis and candidate gene prioritization , 2009, Nucleic Acids Res..

[29]  Bing Zhang,et al.  WebGestalt: an integrated system for exploring gene sets in various biological contexts , 2005, Nucleic Acids Res..

[30]  J. Shaffer Multiple Hypothesis Testing , 1995 .

[31]  Ben S. Wittner,et al.  Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1 , 2009, Nature.

[32]  Helga Thorvaldsdóttir,et al.  Molecular signatures database (MSigDB) 3.0 , 2011, Bioinform..

[33]  Alexey Sergushichev,et al.  An algorithm for fast preranked gene set enrichment analysis using cumulative statistic calculation , 2016 .

[34]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[35]  Michael L. Bittner,et al.  Evaluating Gene Set Enrichment Analysis Via a Hybrid Data Model , 2014, Cancer informatics.

[36]  Juancarlos Chan,et al.  Gene Ontology Consortium: going forward , 2014, Nucleic Acids Res..

[37]  R. Tibshirani,et al.  On testing the significance of sets of genes , 2006, math/0610667.

[38]  Farid Zayeri,et al.  Assessment of gene set analysis methods based on microarray data. , 2014, Gene.

[39]  Weidong Tian,et al.  LEGO: a novel method for gene set over-representation analysis by incorporating network-based gene weights , 2016, Scientific Reports.

[40]  Simon Dirmeier,et al.  A comprehensive gene regulatory network for the diauxic shift in Saccharomyces cerevisiae , 2013, Nucleic acids research.

[41]  Peter Bühlmann,et al.  Analyzing gene expression data in terms of gene sets: methodological issues , 2007, Bioinform..

[42]  Monther Alhamdoosh,et al.  Combining multiple tools outperforms individual methods in gene set enrichment analyses , 2015, bioRxiv.

[43]  Anushya Muruganujan,et al.  Large-scale gene function analysis with the PANTHER classification system , 2013, Nature Protocols.

[44]  Robert Gentleman,et al.  Using GOstats to test gene lists for GO term association , 2007, Bioinform..

[45]  Guangchuang Yu,et al.  clusterProfiler: an R package for comparing biological themes among gene clusters. , 2012, Omics : a journal of integrative biology.

[46]  Justin Guinney,et al.  GSVA: gene set variation analysis for microarray and RNA-Seq data , 2013, BMC Bioinformatics.

[47]  Charity W. Law,et al.  voom: precision weights unlock linear model analysis tools for RNA-seq read counts , 2014, Genome Biology.

[48]  Avi Ma'ayan,et al.  Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool , 2013, BMC Bioinformatics.

[49]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[50]  Joshua M. Stuart,et al.  The Cancer Genome Atlas Pan-Cancer analysis project , 2013, Nature Genetics.

[51]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[52]  Mark D. Robinson,et al.  edgeR: a Bioconductor package for differential expression analysis of digital gene expression data , 2009, Bioinform..

[53]  Sean R. Davis,et al.  NCBI GEO: archive for functional genomics data sets—update , 2012, Nucleic Acids Res..

[54]  Qi Liu,et al.  Improving gene set analysis of microarray data by SAM-GS , 2007, BMC Bioinformatics.

[55]  Sorin Draghici,et al.  Down-weighting overlapping genes improves gene set analysis , 2012, BMC Bioinformatics.

[56]  Colin N. Dewey,et al.  RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome , 2011, BMC Bioinformatics.

[57]  Hedi Peterson,et al.  g:Profiler—a web-based toolset for functional profiling of gene lists from large-scale experiments , 2007, Nucleic Acids Res..

[58]  Ralf Zimmer,et al.  Rigorous assessment of gene set enrichment tests , 2012, Bioinform..

[59]  Lincoln Stein,et al.  Reactome: a database of reactions, pathways and biological processes , 2010, Nucleic Acids Res..

[60]  Roberto Romero,et al.  A Comparison of Gene Set Analysis Methods in Terms of Sensitivity, Prioritization and Specificity , 2013, PloS one.

[61]  Raphael Gottardo,et al.  Orchestrating high-throughput genomic analysis with Bioconductor , 2015, Nature Methods.

[62]  A. Nobel,et al.  Heading Down the Wrong Pathway: on the Influence of Correlation within Gene Sets , 2010, BMC Genomics.

[63]  Jelle Goeman,et al.  Simultaneous Enrichment Analysis of all Possible Gene-sets: Unifying Self-Contained and Competitive Methods , 2019, Briefings Bioinform..

[64]  Brad T. Sherman,et al.  Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources , 2008, Nature Protocols.

[65]  Andrew B. Nobel,et al.  Significance analysis of functional categories in gene expression studies: a structured permutation approach , 2005, Bioinform..

[66]  Henryk Maciejewski,et al.  Gene set analysis methods: statistical models and methodological differences , 2013, Briefings Bioinform..

[67]  J. Mesirov,et al.  The limitations of simple gene set enrichment analysis assuming gene independence , 2011, J. Biomed. Informatics.

[68]  Paul Harrison Anscombe's 1948 variance stabilizing transformation for the negative binomial distribution is well suited to RNA-Seq expression data , 2015 .

[69]  Steven J. M. Jones,et al.  Oncogenic Signaling Pathways in The Cancer Genome Atlas. , 2018, Cell.

[70]  F. J. Anscombe,et al.  THE TRANSFORMATION OF POISSON, BINOMIAL AND NEGATIVE-BINOMIAL DATA , 1948 .

[71]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[72]  Susumu Goto,et al.  Data, information, knowledge and principle: back to metabolism in KEGG , 2013, Nucleic Acids Res..

[73]  Zhiping Weng,et al.  Gene set enrichment analysis: performance evaluation and usage guidelines , 2012, Briefings Bioinform..

[74]  Ralf Zimmer,et al.  Bioconductor’s EnrichmentBrowser: seamless navigation through combined results of set- & network-based enrichment analysis , 2016, BMC Bioinformatics.

[75]  Ben-Ari FuchsShani,et al.  GeneAnalytics: An Integrative Gene Set Analysis Tool for Next Generation Sequencing, RNAseq and Microarray Data , 2016 .

[76]  B. Oliver,et al.  Microarrays, deep sequencing and the true measure of the transcriptome , 2011, BMC Biology.