Gene set analysis methods: statistical models and methodological differences

Many methods of gene set analysis developed in recent years have been compared empirically in a number of comprehensive review articles. Although it is recognized that different methods tend to identify different gene sets as significant, no consensus has been worked out as to which method is preferable, as the recommendations are often contradictory. In this article, we want to group and compare different methods in terms of the methodological assumptions pertaining to definition of a sample and formulation of the actual null hypothesis. We discuss four models of statistical experiment explicitly or implicitly assumed by most if not all currently available methods of gene set analysis. We analyse validity of the models in the context of the actual biological experiment. Based on this, we recommend a group of methods that provide biologically interpretable results in statistically sound way. Finally, we demonstrate how correlated or low signal-to-noise data affects performance of different methods, observed in terms of the false-positive rate and power.

[1]  William Stafford Noble,et al.  Exploring Gene Expression Data with Class Scores , 2001, Pacific Symposium on Biocomputing.

[2]  Eytan Domany,et al.  Outcome signature genes in breast cancer: is there a unique set? , 2004, Breast Cancer Research.

[3]  L. Ein-Dor,et al.  Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. , 2006, Proceedings of the National Academy of Sciences of the United States of America.

[4]  Jelle J. Goeman,et al.  Testing association of a pathway with survival using gene expression data , 2005, Bioinform..

[5]  U. Mansmann,et al.  Testing Differential Gene Expression in Functional Groups , 2005, Methods of Information in Medicine.

[6]  Seon-Young Kim,et al.  PAGE: Parametric Analysis of Gene Set Enrichment , 2005, BMC Bioinform..

[7]  Peter Bühlmann,et al.  Analyzing gene expression data in terms of gene sets: methodological issues , 2007, Bioinform..

[8]  Michael A. Black,et al.  Microarray-based gene set analysis: a comparison of current methods , 2008, BMC Bioinformatics.

[9]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[10]  John D. Potter,et al.  A Biological Evaluation of Six Gene Set Analysis Methods for Identification of Differentially Expressed Pathways in Microarray Data , 2008, Cancer informatics.

[11]  Korbinian Strimmer,et al.  BMC Bioinformatics BioMed Central Methodology article A general modular framework for gene set enrichment analysis , 2009 .

[12]  Zhen Jiang,et al.  Bioconductor Project Bioconductor Project Working Papers Year Paper Extensions to Gene Set Enrichment , 2013 .

[13]  Zhiping Weng,et al.  Gene set enrichment analysis: performance evaluation and usage guidelines , 2012, Briefings Bioinform..

[14]  Qi Liu,et al.  BMC Bioinformatics BioMed Central Methodology article Comparative evaluation of gene-set analysis methods , 2007 .

[15]  U. Mansmann Genomic profiling. Interplay between clinical epidemiology, bioinformatics and biostatistics. , 2005 .

[16]  Jelle J. Goeman,et al.  A global test for groups of genes: testing association with a clinical outcome , 2004, Bioinform..

[17]  P. Park,et al.  Discovering statistically significant pathways in expression profiling studies. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[18]  V. Arango,et al.  Using the Gene Ontology for Microarray Data Mining: A Comparison of Methods and Application to Age Effects in Human Prefrontal Cortex , 2004, Neurochemical Research.

[19]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[20]  Qi Liu,et al.  Improving gene set analysis of microarray data by SAM-GS , 2007, BMC Bioinformatics.

[21]  Peter J. Park,et al.  A multivariate approach for integrating genome-wide expression data and biological knowledge , 2006, Bioinform..

[22]  Andrew B. Nobel,et al.  Significance analysis of functional categories in gene expression studies: a structured permutation approach , 2005, Bioinform..

[23]  Rafael A Irizarry,et al.  Gene set enrichment analysis made simple , 2009, Statistical methods in medical research.

[24]  R. Tibshirani,et al.  On testing the significance of sets of genes , 2006, math/0610667.

[25]  Seon-Young Kim,et al.  Gene-set approach for expression pattern analysis , 2008, Briefings Bioinform..

[26]  B. Fridley,et al.  Self-Contained Gene-Set Analysis of Expression Data: An Evaluation of Existing and Novel Methods , 2010, PloS one.

[27]  M. Newton,et al.  Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis , 2007, 0708.4350.

[28]  Michael C Wu,et al.  Prior biological knowledge-based approaches for the analysis of genome-wide expression profiles using gene sets and pathways , 2009, Statistical methods in medical research.