Sources of variation in false discovery rate estimation include sample size, correlation, and inherent differences between groups

BackgroundHigh-throughtput technologies enable the testing of tens of thousands of measurements simultaneously. Identification of genes that are differentially expressed or associated with clinical outcomes invokes the multiple testing problem. False Discovery Rate (FDR) control is a statistical method used to correct for multiple comparisons for independent or weakly dependent test statistics. Although FDR control is frequently applied to microarray data analysis, gene expression is usually correlated, which might lead to inaccurate estimates. In this paper, we evaluate the accuracy of FDR estimation.MethodsUsing two real data sets, we resampled subgroups of patients and recalculated statistics of interest to illustrate the imprecision of FDR estimation. Next, we generated many simulated data sets with block correlation structures and realistic noise parameters, using the Ultimate Microarray Prediction, Inference, and Reality Engine (UMPIRE) R package. We estimated FDR using a beta-uniform mixture (BUM) model, and examined the variation in FDR estimation.ResultsThe three major sources of variation in FDR estimation are the sample size, correlations among genes, and the true proportion of differentially expressed genes (DEGs). The sample size and proportion of DEGs affect both magnitude and precision of FDR estimation, while the correlation structure mainly affects the variation of the estimated parameters.ConclusionsWe have decomposed various factors that affect FDR estimation, and illustrated the direction and extent of the impact. We found that the proportion of DEGs has a significant impact on FDR; this factor might have been overlooked in previous studies and deserves more thought when controlling FDR.

[1]  Casper J. Albers,et al.  SIMAGE: simulation of DNA-microarray gene expression data , 2006, BMC Bioinformatics.

[2]  Cheng Cheng,et al.  Improving false discovery rate estimation , 2004, Bioinform..

[3]  L. Staudt,et al.  Prediction of survival in follicular lymphoma based on molecular features of tumor-infiltrating immune cells. , 2004, The New England journal of medicine.

[4]  Matti Nykter,et al.  Simulation of microarray data with realistic characteristics , 2006, BMC Bioinformatics.

[5]  D. Allison,et al.  Microarray data analysis: from disarray to consolidation and consensus , 2006, Nature Reviews Genetics.

[6]  Xing Qiu,et al.  Assessing stability of gene selection in microarray data analysis , 2006, BMC Bioinformatics.

[7]  Per Broberg,et al.  A comparative review of estimates of the proportion unchanged genes and the false discovery rate , 2005, BMC Bioinformatics.

[8]  R. Tibshirani,et al.  Empirical bayes methods and false discovery rates for microarrays , 2002, Genetic epidemiology.

[9]  R. Tibshirani,et al.  Gene expression profiling identifies clinically relevant subtypes of prostate cancer. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[10]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[11]  Xing Qiu,et al.  Some Comments on Instability of False Discovery Rate Estimation , 2006, J. Bioinform. Comput. Biol..

[12]  B. Efron Correlation and Large-Scale Simultaneous Significance Testing , 2007 .

[13]  James Long,et al.  Synthetic microarray data generation with RANGE and NEMO , 2008, Bioinform..

[14]  Stan Pounds,et al.  Estimating the Occurrence of False Positives and False Negatives in Microarray Studies by Approximating and Partitioning the Empirical Distribution of P-values , 2003, Bioinform..

[15]  Sunil Singhal,et al.  MicroArray Data Simulator For Improved Selection of Differentially Expressed Genes , 2003, Cancer biology & therapy.

[16]  Yudi Pawitan,et al.  Estimation of false discovery proportion under general dependence , 2006, Bioinform..

[17]  B. Efron Size, power and false discovery rates , 2007, 0710.2245.

[18]  Shuguang Huang,et al.  Comparison of false discovery rate methods in identifying genes with differential expression. , 2005, Genomics.

[19]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[20]  Simon Rosenfeld,et al.  Numerical Deconvolution of cDNA Microarray Signal: Simulation Study , 2004, Annals of the New York Academy of Sciences.

[21]  Xin Lu,et al.  Re-sampling strategy to improve the estimation of number of null hypotheses in FDR control under strong correlation structures , 2007, BMC Bioinformatics.

[22]  William Stafford Noble,et al.  The effect of replication on gene expression microarray experiments , 2003, Bioinform..