Gene set bagging for estimating the probability a statistically significant result will replicate

BackgroundSignificance analysis plays a major role in identifying and ranking genes, transcription factor binding sites, DNA methylation regions, and other high-throughput features associated with illness. We propose a new approach, called gene set bagging, for measuring the probability that a gene set replicates in future studies. Gene set bagging involves resampling the original high-throughput data, performing gene-set analysis on the resampled data, and confirming that biological categories replicate in the bagged samples.ResultsUsing both simulated and publicly-available genomics data, we demonstrate that significant categories in a gene set enrichment analysis may be unstable when subjected to resampling. We show our method estimates the replication probability (R), the probability that a gene set will replicate as a significant result in future studies, and show in simulations that this method reflects replication better than each set’s p-value.ConclusionsOur results suggest that gene lists based on p-values are not necessarily stable, and therefore additional steps like gene set bagging may improve biological inference on gene sets.

[1]  L.L. Elo,et al.  Reproducibility-Optimized Test Statistic for Ranking Genes in Microarray Studies , 2008, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[2]  J. Felsenstein CONFIDENCE LIMITS ON PHYLOGENIES: AN APPROACH USING THE BOOTSTRAP , 1985, Evolution; international journal of organic evolution.

[3]  Tso-Jung Yen,et al.  Discussion on "Stability Selection" by Meinshausen and Buhlmann , 2010 .

[4]  Gordon K. Smyth,et al.  limma: Linear Models for Microarray Data , 2005 .

[5]  A. Nobel,et al.  Heading Down the Wrong Pathway: on the Influence of Correlation within Gene Sets , 2010, BMC Genomics.

[6]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[7]  Rafael A. Irizarry,et al.  Bioinformatics and Computational Biology Solutions using R and Bioconductor , 2005 .

[8]  Gordon K Smyth,et al.  Statistical Applications in Genetics and Molecular Biology Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments , 2011 .

[9]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[10]  B. Efron,et al.  Bootstrap confidence levels for phylogenetic trees. , 1996, Proceedings of the National Academy of Sciences of the United States of America.

[11]  Michael E Phelps,et al.  Systems Biology and New Technologies Enable Predictive and Preventative Medicine , 2004, Science.

[12]  Jonathan Pevsner,et al.  DNA methylation signatures within the human brain. , 2007, American journal of human genetics.

[13]  Ashutosh Kumar Singh,et al.  The Elements of Statistical Learning: Data Mining, Inference, and Prediction , 2010 .

[14]  R. Tibshirani,et al.  On testing the significance of sets of genes , 2006, math/0610667.

[15]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[16]  Luigi Ferrucci,et al.  Abundant Quantitative Trait Loci Exist for DNA Methylation and Gene Expression in Human Brain , 2010, PLoS genetics.

[17]  Jeffrey T Leek,et al.  A general framework for multiple testing dependence , 2008, Proceedings of the National Academy of Sciences.

[18]  Robert Clarke,et al.  Knowledge-guided gene ranking by coordinative component analysis , 2010, BMC Bioinformatics.

[19]  H. Jeffreys An invariant form for the prior probability in estimation problems , 1946, Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences.

[20]  R. Tibshirani,et al.  The problem of regions , 1998 .

[21]  Rafael A Irizarry,et al.  Exploration, normalization, and summaries of high density oligonucleotide array probe level data. , 2003, Biostatistics.

[22]  K. Kinzler,et al.  Cancer genes and the pathways they control , 2004, Nature Medicine.

[23]  B. Aggarwal,et al.  Cigarette smoke condensate activates nuclear transcription factor-kappaB through phosphorylation and degradation of IkappaB(alpha): correlation with induction of cyclooxygenase-2. , 2002, Carcinogenesis.

[24]  Frank Preiswerk,et al.  Stability of gene contributions and identification of outliers in multivariate analysis of microarray data , 2008, BMC Bioinformatics.

[25]  J. Davis Bioinformatics and Computational Biology Solutions Using R and Bioconductor , 2007 .

[26]  N. Meinshausen,et al.  Stability selection , 2008, 0809.2932.

[27]  B. Aggarwal,et al.  Cigarette smoke condensate activates nuclear transcription factor-κB through phosphorylation and degradation of IκBα: correlation with induction of cyclooxygenase-2 , 2002 .

[28]  Justin Zobel,et al.  Prediction of breast cancer prognosis using gene set statistics provides signature stability and biological context , 2010, BMC Bioinformatics.

[29]  R. Yantiss,et al.  Effects of Cigarette Smoke on the Human Oral Mucosal Transcriptome , 2010, Cancer Prevention Research.

[30]  Matthew E Ritchie,et al.  Integrative analysis of RUNX1 downstream pathways and target genes , 2008, BMC Genomics.