Concordant integrative gene set enrichment analysis of multiple large-scale two-sample expression data sets

BackgroundGene set enrichment analysis (GSEA) is an important approach to the analysis of coordinate expression changes at a pathway level. Although many statistical and computational methods have been proposed for GSEA, the issue of a concordant integrative GSEA of multiple expression data sets has not been well addressed. Among different related data sets collected for the same or similar study purposes, it is important to identify pathways or gene sets with concordant enrichment.MethodsWe categorize the underlying true states of differential expression into three representative categories: no change, positive change and negative change. Due to data noise, what we observe from experiments may not indicate the underlying truth. Although these categories are not observed in practice, they can be considered in a mixture model framework. Then, we define the mathematical concept of concordant gene set enrichment and calculate its related probability based on a three-component multivariate normal mixture model. The related false discovery rate can be calculated and used to rank different gene sets.ResultsWe used three published lung cancer microarray gene expression data sets to illustrate our proposed method. One analysis based on the first two data sets was conducted to compare our result with a previous published result based on a GSEA conducted separately for each individual data set. This comparison illustrates the advantage of our proposed concordant integrative gene set enrichment analysis. Then, with a relatively new and larger pathway collection, we used our method to conduct an integrative analysis of the first two data sets and also all three data sets. Both results showed that many gene sets could be identified with low false discovery rates. A consistency between both results was also observed. A further exploration based on the KEGG cancer pathway collection showed that a majority of these pathways could be identified by our proposed method.ConclusionsThis study illustrates that we can improve detection power and discovery consistency through a concordant integrative analysis of multiple large-scale two-sample gene expression data sets.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[3]  D. Lockhart,et al.  Expression monitoring by hybridization to high-density oligonucleotide arrays , 1996, Nature Biotechnology.

[4]  Sangsoo Kim,et al.  Combining multiple microarray studies and modeling interstudy variation , 2003, ISMB.

[5]  T. McCaffrey,et al.  Cardiovascular Inflammation and Lesion Cell Apoptosis: A Novel Connection via the Interferon-Inducible Immunoproteasome , 2009, Arteriosclerosis, thrombosis, and vascular biology.

[6]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Yinglei Lai,et al.  A statistical framework for integrating two microarray data sets in differential expression analysis , 2009, BMC Bioinformatics.

[8]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[9]  Joshua M. Korn,et al.  Comprehensive genomic characterization defines human glioblastoma genes and core pathways , 2008, Nature.

[10]  Ronald W. Davis,et al.  Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray , 1995, Science.

[11]  Katerina Kechris,et al.  The discordant method: a novel approach for differential correlation , 2016, Bioinform..

[12]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[13]  M. Daly,et al.  PGC-1α-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes , 2003, Nature Genetics.

[14]  M. Faris Atypical B Cell Receptor Signaling: Straddling Immune Diseases and Cancer , 2013, International reviews of immunology.

[15]  Pankaj Agarwal,et al.  Gene Vector Analysis (Geneva): A unified method to detect differentially-regulated gene sets and similar microarray experiments , 2008, BMC Bioinformatics.

[16]  D. Botstein,et al.  Diversity of gene expression in adenocarcinoma of the lung , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[17]  G. McLachlan,et al.  The EM Algorithm and Extensions: Second Edition , 2008 .

[18]  Henryk Maciejewski,et al.  Gene set analysis methods: statistical models and methodological differences , 2013, Briefings Bioinform..

[19]  Geoffrey J. McLachlan,et al.  A simple implementation of a normal mixture approach to differential gene expression in multiclass microarrays , 2006, Bioinform..

[20]  Yinglei Lai,et al.  A mixture model approach to the tests of concordance and discordance between two large-scale experiments with two-sample groups , 2007, Bioinform..

[21]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[22]  João Pedro de Magalhães,et al.  Meta-analysis of age-related gene expression profiles identifies common signatures of aging , 2009, Bioinform..

[23]  R. Tibshirani,et al.  On testing the significance of sets of genes , 2006, math/0610667.

[24]  Bani K. Mallick,et al.  Enrichment analysis in high-throughput genomics - accounting for dependency in the NULL , 2007, Briefings Bioinform..

[25]  George C. Tseng,et al.  Meta-analysis for pathway enrichment analysis when combining multiple genomic studies , 2010, Bioinform..

[26]  I. Goodhead,et al.  Dynamic repertoire of a eukaryotic transcriptome surveyed at single-nucleotide resolution , 2008, Nature.

[27]  M. Gerstein,et al.  The Transcriptional Landscape of the Yeast Genome Defined by RNA Sequencing , 2008, Science.

[28]  Dhammika Amaratunga,et al.  Exploration and Analysis of DNA Microarray and Protein Array Data , 2003, Wiley series in probability and statistics.

[29]  References , 1971 .

[30]  Hui-Chen Hsu,et al.  Regulation of apoptosis proteins in cancer cells by ubiquitin , 2004, Oncogene.

[31]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[32]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[33]  Guanghua Xiao,et al.  A powerful Bayesian meta-analysis method to integrate multiple gene set enrichment studies , 2013, Bioinform..

[34]  D. Gandara,et al.  Incorporating Bortezomib into the Treatment of Lung Cancer , 2007, Clinical Cancer Research.