Gene set enrichment for reproducible science: comparison of CERNO and eight other algorithms

Abstract Motivation Analysis of gene set (GS) enrichment is an essential part of functional omics studies. Here, we complement the established evaluation metrics of GS enrichment algorithms with a novel approach to assess the practical reproducibility of scientific results obtained from GS enrichment tests when applied to related data from different studies. Results We evaluated eight established and one novel algorithm for reproducibility, sensitivity, prioritization, false positive rate and computational time. In addition to eight established algorithms, we also included Coincident Extreme Ranks in Numerical Observations (CERNO), a flexible and fast algorithm based on modified Fisher P-value integration. Using real-world datasets, we demonstrate that CERNO is robust to ranking metrics, as well as sample and GS size. CERNO had the highest reproducibility while remaining sensitive, specific and fast. In the overall ranking Pathway Analysis with Down-weighting of Overlapping Genes, CERNO and over-representation analysis performed best, while CERNO and GeneSetTest scored high in terms of reproducibility. Availability and implementation tmod package implementing the CERNO algorithm is available from CRAN (cran.r-project.org/web/packages/tmod/index.html) and an online implementation can be found at http://tmod.online/. The datasets analyzed in this study are widely available in the KEGGdzPathwaysGEO, KEGGandMetacoreDzPathwaysGEO R package and GEO repository. Supplementary information Supplementary data are available at Bioinformatics online.

[1]  Alex E. Lash,et al.  Gene Expression Omnibus: NCBI gene expression and hybridization array data repository , 2002, Nucleic Acids Res..

[2]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[3]  Jaques Reifman,et al.  A strategy for evaluating pathway analysis methods , 2017, BMC Bioinformatics.

[4]  Justin Guinney,et al.  GSVA: gene set variation analysis for microarray and RNA-Seq data , 2013, BMC Bioinformatics.

[5]  Ralf Zimmer,et al.  Bioconductor’s EnrichmentBrowser: seamless navigation through combined results of set- & network-based enrichment analysis , 2016, BMC Bioinformatics.

[6]  Harrison Pielke-Lombardo,et al.  GSEA-InContext: identifying novel and common patterns in expression experiments , 2018, bioRxiv.

[7]  Ruedi Aebersold,et al.  Epigenetics and Proteomics Join Transcriptomics in the Quest for Tuberculosis Biomarkers , 2015, mBio.

[8]  Pooja Mittal,et al.  A novel signaling pathway impact analysis , 2009, Bioinform..

[9]  Minoru Kanehisa,et al.  KEGG as a reference resource for gene and protein annotation , 2015, Nucleic Acids Res..

[10]  Tim Beißbarth,et al.  Comparative study on gene set and pathway topology-based enrichment methods , 2015, BMC Bioinformatics.

[11]  Teresa Domaszewska,et al.  tmod: an R package for general and multivariate enrichment analysis , 2016 .

[12]  Michel R. Klein,et al.  Metabolite changes in blood predict the onset of tuberculosis , 2018, Nature Communications.

[13]  Ni Li,et al.  Gene Ontology Annotations and Resources , 2012, Nucleic Acids Res..

[14]  Jin Wang,et al.  Centrality-based pathway enrichment: a systematic approach for finding significant pathways dominated by key genes , 2012, BMC Systems Biology.

[15]  Gerhard Walzl,et al.  Safety and Immunogenicity of the Recombinant Mycobacterium bovis BCG Vaccine VPM1002 in HIV-Unexposed Newborn Infants in South Africa , 2016, Clinical and Vaccine Immunology.

[16]  Henning Hermjakob,et al.  The Reactome pathway Knowledgebase , 2015, Nucleic acids research.

[17]  Henryk Maciejewski,et al.  Gene set analysis methods: statistical models and methodological differences , 2013, Briefings Bioinform..

[18]  Thomas Lengauer,et al.  Statistical Applications in Genetics and Molecular Biology Calculating the Statistical Significance of Changes in Pathway Activity From Gene Expression Data , 2011 .

[19]  Joanna Polanska,et al.  Ranking metrics in gene set enrichment analysis: do they matter? , 2017, BMC Bioinformatics.

[20]  J. Mesirov,et al.  The limitations of simple gene set enrichment analysis assuming gene independence , 2011, J. Biomed. Informatics.

[21]  John P. A. Ioannidis,et al.  A manifesto for reproducible science , 2017, Nature Human Behaviour.

[22]  Ali Shojaie,et al.  Gene set analysis methods: a systematic comparison , 2018, BioData Mining.

[23]  Rafael A. Irizarry,et al.  Bioinformatics and Computational Biology Solutions using R and Bioconductor , 2005 .

[24]  Virginia Pascual,et al.  A modular analysis framework for blood genomics studies: application to systemic lupus erythematosus. , 2008, Immunity.

[25]  Atul J. Butte,et al.  Ten Years of Pathway Analysis: Current Approaches and Outstanding Challenges , 2012, PLoS Comput. Biol..

[26]  Jelle J. Goeman,et al.  A global test for groups of genes: testing association with a clinical outcome , 2004, Bioinform..

[27]  R. A. van den Berg,et al.  Adjuvant-Associated Peripheral Blood mRNA Profiles and Kinetics Induced by the Adjuvanted Recombinant Protein Candidate Tuberculosis Vaccine M72/AS01 in Bacillus Calmette–Guérin-Vaccinated Adults , 2018, Front. Immunol..

[28]  Ian McQuillan,et al.  Sample Size and Reproducibility of Gene Set Analysis , 2018, 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM).

[29]  Yumei Song,et al.  A Novel Signaling Pathway , 2009, The Journal of Biological Chemistry.

[30]  Daniel Toro-Domínguez,et al.  Stratification of Systemic Lupus Erythematosus Patients Into Three Groups of Disease Activity Progression According to Longitudinal Gene Expression , 2018, Arthritis & rheumatology.

[31]  Maria K. Jaakkola,et al.  PASI: A novel pathway method to identify delicate group effects , 2018, PloS one.

[32]  Monther Alhamdoosh,et al.  Combining multiple tools outperforms individual methods in gene set enrichment analyses , 2015, bioRxiv.

[33]  Henning Hermjakob,et al.  The Reactome pathway knowledgebase , 2013, Nucleic Acids Res..

[34]  Brad T. Sherman,et al.  Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists , 2008, Nucleic acids research.

[35]  Maria K. Jaakkola,et al.  Empirical comparison of structure-based pathway methods , 2015, Briefings Bioinform..

[36]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.

[37]  Robert Gentleman,et al.  Using GOstats to test gene lists for GO term association , 2007, Bioinform..

[38]  Werner Baumgartner,et al.  A Nonparametric Test for the General Two-Sample Problem , 1998 .

[39]  G. Church,et al.  Systematic determination of genetic network architecture , 1999, Nature Genetics.

[40]  Wei Li Analyzing Gene Expression Data in Terms of Gene Sets: Gene Set Enrichment Analysis , 2009 .

[41]  Yuri Kotliarov,et al.  The Immunome in Two Inherited Forms of Pulmonary Fibrosis , 2018, Front. Immunol..

[42]  Yudi Pawitan,et al.  Unequal group variances in microarray data analyses , 2008, Bioinform..

[43]  Weidong Tian,et al.  LEGO: a novel method for gene set over-representation analysis by incorporating network-based gene weights , 2016, Scientific Reports.

[44]  Jun Lu,et al.  Pathway level analysis of gene expression using singular value decomposition , 2005, BMC Bioinformatics.

[45]  Joanna Polanska,et al.  Reproducibility of Finding Enriched Gene Sets in Biological Data Analysis , 2017, PACBB.

[46]  Stefan H. E. Kaufmann,et al.  Concordant and discordant gene expression patterns in mouse strains identify best-fit animal model for human tuberculosis , 2017, Scientific Reports.

[47]  G. Michailidis,et al.  Network Enrichment Analysis in Complex Experiments , 2010, Statistical applications in genetics and molecular biology.

[48]  R. Tibshirani,et al.  On testing the significance of sets of genes , 2006, math/0610667.

[49]  Eva Budinska,et al.  A critical comparison of topology-based pathway analysis methods , 2018, PloS one.

[50]  Sorin Draghici,et al.  Down-weighting overlapping genes improves gene set analysis , 2012, BMC Bioinformatics.

[51]  Philip L Felgner,et al.  Dynamic antibody responses to the Mycobacterium tuberculosis proteome , 2010, Proceedings of the National Academy of Sciences.

[52]  Sandra Romero-Steiner,et al.  Molecular signatures of antibody responses derived from a systems biology study of five human vaccines , 2022 .

[53]  Annarita D'Addabbo,et al.  Comparative study of gene set enrichment methods , 2009, BMC Bioinformatics.

[54]  B. Aggarwal,et al.  Cancer is a Preventable Disease that Requires Major Lifestyle Changes , 2008, Pharmaceutical Research.

[55]  Roberto Romero,et al.  A Comparison of Gene Set Analysis Methods in Terms of Sensitivity, Prioritization and Specificity , 2013, PloS one.

[56]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[57]  Leon French,et al.  Transcriptomic characterization of MRI contrast with focus on the T1-w/T2-w ratio in the cerebral cortex , 2017, NeuroImage.

[58]  M. Baker 1,500 scientists lift the lid on reproducibility , 2016, Nature.

[59]  G. Gilfillan,et al.  Transcriptomics of the Vaccine Immune Response: Priming With Adjuvant Modulates Recall Innate Responses After Boosting , 2018, Front. Immunol..

[60]  Pablo Tamayo,et al.  Compendium of Immune Signatures Identifies Conserved and Species-Specific Biology in Response to Inflammation. , 2016, Immunity.

[61]  Melissa J. Davis,et al.  Single sample scoring of molecular phenotypes , 2018, BMC Bioinformatics.

[62]  Daniel L. Ruderman,et al.  IFN-β-regulated genes show abnormal expression in therapy-naïve relapsing–remitting MS mononuclear cells: Gene expression analysis employing all reported protein–protein interactions , 2008, Journal of Neuroimmunology.

[63]  Mary F. McGuire,et al.  Data driven linear algebraic methods for analysis of molecular pathways: Application to disease progression in shock/trauma , 2012, J. Biomed. Informatics.

[64]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[65]  John D. Storey A direct approach to false discovery rates , 2002 .