A general modular framework for gene set enrichment analysis

BackgroundAnalysis of microarray and other high-throughput data on the basis of gene sets, rather than individual genes, is becoming more important in genomic studies. Correspondingly, a large number of statistical approaches for detecting gene set enrichment have been proposed, but both the interrelations and the relative performance of the various methods are still very much unclear.ResultsWe conduct an extensive survey of statistical approaches for gene set analysis and identify a common modular structure underlying most published methods. Based on this finding we propose a general framework for detecting gene set enrichment. This framework provides a meta-theory of gene set analysis that not only helps to gain a better understanding of the relative merits of each embedded approach but also facilitates a principled comparison and offers insights into the relative interplay of the methods.ConclusionWe use this framework to conduct a computer simulation comparing 261 different variants of gene set enrichment procedures and to analyze two experimental data sets. Based on the results we offer recommendations for best practices regarding the choice of effective procedures for gene set enrichment analysis.

[1]  Bing Zhang,et al.  GOTree Machine (GOTM): a web-based platform for interpreting sets of interesting genes using Gene Ontology hierarchies , 2004, BMC Bioinformatics.

[2]  William Stafford Noble,et al.  Exploring Gene Expression Data with Class Scores , 2001, Pacific Symposium on Biocomputing.

[3]  Christina Backes,et al.  Computation of significance scores of unweighted Gene Set Enrichment Analyses , 2007, BMC Bioinformatics.

[4]  Korbinian Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology , 2005 .

[5]  M. Newton,et al.  Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis , 2007, 0708.4350.

[6]  R. Tibshirani,et al.  On testing the significance of sets of genes , 2006, math/0610667.

[7]  U. Mansmann,et al.  Testing Differential Gene Expression in Functional Groups , 2005, Methods of Information in Medicine.

[8]  K. Strimmer,et al.  Statistical Applications in Genetics and Molecular Biology A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics , 2011 .

[9]  P. Khatri,et al.  Global functional profiling of gene expression. , 2003, Genomics.

[10]  Seon-Young Kim,et al.  PAGE: Parametric Analysis of Gene Set Enrichment , 2005, BMC Bioinform..

[11]  Kevin G Becker,et al.  Transcriptional Profiling of Aging in Human Muscle Reveals a Common Aging Signature , 2006, PLoS genetics.

[12]  Seon-Young Kim,et al.  Gene-set approach for expression pattern analysis , 2008, Briefings Bioinform..

[13]  Ilya Shmulevich,et al.  ProbCD: enrichment analysis accounting for categorization uncertainty , 2007, BMC Bioinformatics.

[14]  Sayan Mukherjee,et al.  Analysis of sample set enrichment scores: assaying the enrichment of sets of genes for individual samples in genome-wide expression profiles , 2006, ISMB.

[15]  Rafael A. Irizarry,et al.  Bioinformatics and Computational Biology Solutions using R and Bioconductor , 2005 .

[16]  Galina V. Glazko,et al.  A Multivariate Extension of the gene Set Enrichment Analysis , 2007, J. Bioinform. Comput. Biol..

[17]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[18]  Tao Chen,et al.  Significance analysis of groups of genes in expression profiling studies , 2007, Bioinform..

[19]  Jürgen Läuter,et al.  High‐dimensional data analysis: Selection of variables, data compression and graphics – Application to gene expression , 2009, Biometrical journal. Biometrische Zeitschrift.

[20]  Thomas Lengauer,et al.  Improved scoring of functional groups from gene expression data by decorrelating GO graph structure , 2006, Bioinform..

[21]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[22]  J. Davis Bioinformatics and Computational Biology Solutions Using R and Bioconductor , 2007 .

[23]  R. Fisher On the Interpretation of χ2 from Contingency Tables, and the Calculation of P , 2010 .

[24]  M. Ashburner,et al.  Gene Ontology: tool for the unification of biology , 2000, Nature Genetics.

[25]  Douglas A. Hosack,et al.  Identifying biological themes within lists of genes with EASE , 2003, Genome Biology.

[26]  Andrew B. Nobel,et al.  Significance analysis of functional categories in gene expression studies: a structured permutation approach , 2005, Bioinform..

[27]  Thomas Lengauer,et al.  Statistical Applications in Genetics and Molecular Biology Calculating the Statistical Significance of Changes in Pathway Activity From Gene Expression Data , 2011 .

[28]  Peter Bühlmann,et al.  Analyzing gene expression data in terms of gene sets: methodological issues , 2007, Bioinform..

[29]  Michael A. Black,et al.  Microarray-based gene set analysis: a comparison of current methods , 2008, BMC Bioinformatics.

[30]  P. Khatri,et al.  Global functional profiling of gene expression ? ? This work was funded in part by a Sun Microsystem , 2003 .

[31]  J. Sudbø,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[32]  Alex Lewin,et al.  Grouping Gene Ontology terms to improve the assessment of gene set enrichment in microarray data , 2006, BMC Bioinformatics.

[33]  B. Efron SIMULTANEOUS INFERENCE : WHEN SHOULD HYPOTHESIS TESTING PROBLEMS BE COMBINED? , 2008, 0803.3863.

[34]  U. Mansmann Genomic profiling. Interplay between clinical epidemiology, bioinformatics and biostatistics. , 2005 .

[35]  Jelle J. Goeman,et al.  A global test for groups of genes: testing association with a clinical outcome , 2004, Bioinform..

[36]  Qi Liu,et al.  Improving gene set analysis of microarray data by SAM-GS , 2007, BMC Bioinformatics.

[37]  Zhen Jiang,et al.  Bioconductor Project Bioconductor Project Working Papers Year Paper Extensions to Gene Set Enrichment , 2013 .

[38]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[39]  Joaquín Dopazo,et al.  Formulating and testing hypotheses in functional genomics , 2009, Artif. Intell. Medicine.

[40]  B. Efron Large-Scale Simultaneous Hypothesis Testing , 2004 .

[41]  L. A. Goodman,et al.  Measures of association for cross classifications , 1979 .

[42]  R. Fisher 019: On the Interpretation of x2 from Contingency Tables, and the Calculation of P. , 1922 .

[43]  Serban Nacu,et al.  Gene expression network analysis and applications to immunology , 2007, Bioinform..

[44]  Ulrich Mansmann,et al.  GlobalANCOVA: exploration and assessment of gene group effects , 2008, Bioinform..

[45]  Qi Liu,et al.  BMC Bioinformatics BioMed Central Methodology article Comparative evaluation of gene-set analysis methods , 2007 .

[46]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[47]  Qi Liu,et al.  Pathway Analysis of Microarray Data via Regression , 2008, J. Comput. Biol..

[48]  Peter J. Park,et al.  A multivariate approach for integrating genome-wide expression data and biological knowledge , 2006, Bioinform..

[49]  P. Park,et al.  Discovering statistically significant pathways in expression profiling studies. , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[50]  Jurgen Lauter,et al.  Exact t and F Tests for Analyzing Studies with Multiple Endpoints , 1996 .

[51]  Purvesh Khatri,et al.  Ontological analysis of gene expression data: current tools, limitations, and open problems , 2005, Bioinform..

[52]  Korbinian Strimmer Comments on: Augmenting the bootstrap to analyze high dimensional genomic data , 2008 .