Re-sampling strategy to improve the estimation of number of null hypotheses in FDR control under strong correlation structures

BackgroundWhen conducting multiple hypothesis tests, it is important to control the number of false positives, or the False Discovery Rate (FDR). However, there is a tradeoff between controlling FDR and maximizing power. Several methods have been proposed, such as the q-value method, to estimate the proportion of true null hypothesis among the tested hypotheses, and use this estimation in the control of FDR. These methods usually depend on the assumption that the test statistics are independent (or only weakly correlated). However, many types of data, for example microarray data, often contain large scale correlation structures. Our objective was to develop methods to control the FDR while maintaining a greater level of power in highly correlated datasets by improving the estimation of the proportion of null hypotheses.ResultsWe showed that when strong correlation exists among the data, which is common in microarray datasets, the estimation of the proportion of null hypotheses could be highly variable resulting in a high level of variation in the FDR. Therefore, we developed a re-sampling strategy to reduce the variation by breaking the correlations between gene expression values, then using a conservative strategy of selecting the upper quartile of the re-sampling estimations to obtain a strong control of FDR.ConclusionWith simulation studies and perturbations on actual microarray datasets, our method, compared to competing methods such as q-value, generated slightly biased estimates on the proportion of null hypotheses but with lower mean square errors. When selecting genes with controlling the same FDR level, our methods have on average a significantly lower false discovery rate in exchange for a minor reduction in the power.

[1]  Albert,et al.  Emergence of scaling in random networks , 1999, Science.

[2]  Weichung Joe Shih,et al.  A mixture model for estimating the local false discovery rate in DNA microarray analysis , 2004, Bioinform..

[3]  B. Efron Correlation and Large-Scale Simultaneous Significance Testing , 2007 .

[4]  Zhijin Wu,et al.  Preprocessing of oligonucleotide array data , 2004, Nature Biotechnology.

[5]  Stan Pounds,et al.  Estimating the Occurrence of False Positives and False Negatives in Microarray Studies by Approximating and Partitioning the Empirical Distribution of P-values , 2003, Bioinform..

[6]  M. Goldstein,et al.  Analysis of Gene Expression Data , 2022 .

[7]  Huey-miin Hsueh,et al.  Comparison of Methods for Estimating the Number of True Null Hypotheses in Multiplicity Testing , 2003, Journal of biopharmaceutical statistics.

[8]  Xing Qiu,et al.  Correlation Between Gene Expression Levels and Limitations of the Empirical Bayes Methodology for Finding Differentially Expressed Genes , 2005, Statistical applications in genetics and molecular biology.

[9]  Shridar Ganesan,et al.  X chromosomal abnormalities in basal-like human breast cancer. , 2006, Cancer cell.

[10]  Nicolai Meinshausen,et al.  False Discovery Control for Multiple Tests of Association Under General Dependence , 2006 .

[11]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[12]  B. Lindqvist,et al.  Estimating the proportion of true null hypotheses, with application to DNA microarray data , 2005 .

[13]  Rainer Spang,et al.  twilight; a Bioconductor package for estimating the local false discovery rate , 2005, Bioinform..

[14]  Cheng Cheng,et al.  Improving false discovery rate estimation , 2004, Bioinform..

[15]  Cheng Li,et al.  DNA-Chip Analyzer (dChip) , 2003 .

[16]  Per Broberg,et al.  A comparative review of estimates of the proportion unchanged genes and the false discovery rate , 2005, BMC Bioinformatics.

[17]  David B. Allison,et al.  A mixture model approach for the analysis of microarray gene expression data , 2002 .

[18]  Christian A. Rees,et al.  Molecular portraits of human breast tumours , 2000, Nature.

[19]  R. Tibshirani,et al.  Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[20]  B. Efron Large-Scale Simultaneous Hypothesis Testing , 2004 .

[21]  A. Barabasi,et al.  Hierarchical Organization of Modularity in Metabolic Networks , 2002, Science.

[22]  Jean-Jacques Daudin,et al.  Correction: Determination of the differentially expressed genes in microarray experiments using local FDR , 2005, BMC Bioinformatics.

[23]  Rainer Spang,et al.  A false discovery rate approach to separate the score distributions of induced and non-induced genes , 2003 .

[24]  John D. Storey,et al.  Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach , 2004 .

[25]  Nir Friedman,et al.  Comparative analysis of algorithms for signal quantitation from oligonucleotide microarrays , 2004, Bioinform..

[26]  A. Barabasi,et al.  Network biology: understanding the cell's functional organization , 2004, Nature Reviews Genetics.

[27]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[28]  S. Scheid,et al.  A stochastic downhill search algorithm for estimating the local false discovery rate , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[29]  Hongyu Zhao,et al.  Parametric and Nonparametric FDR Estimation Revisited , 2006, Biometrics.

[30]  Xuegong Zhang,et al.  Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data , 2006, BMC Bioinformatics.

[31]  J. Booth,et al.  Resampling-Based Multiple Testing. , 1994 .

[32]  R. Tibshirani,et al.  Using specially designed exponential families for density estimation , 1996 .

[33]  P. Rousseeuw,et al.  Wiley Series in Probability and Mathematical Statistics , 2005 .

[34]  Y. Benjamini,et al.  On the Adaptive Control of the False Discovery Rate in Multiple Testing With Independent Statistics , 2000 .

[35]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[36]  S. Dudoit,et al.  Resampling-based multiple testing for microarray data analysis , 2003 .

[37]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[38]  John D. Storey A direct approach to false discovery rates , 2002 .

[39]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[40]  Yongchao Ge Resampling-based Multiple Testing for Microarray Data Analysis , 2003 .