A comparative review of estimates of the proportion unchanged genes and the false discovery rate

BackgroundIn the analysis of microarray data one generally produces a vector of p-values that for each gene give the likelihood of obtaining equally strong evidence of change by pure chance. The distribution of these p-values is a mixture of two components corresponding to the changed genes and the unchanged ones. The focus of this article is how to estimate the proportion unchanged and the false discovery rate (FDR) and how to make inferences based on these concepts. Six published methods for estimating the proportion unchanged genes are reviewed, two alternatives are presented, and all are tested on both simulated and real data. All estimates but one make do without any parametric assumptions concerning the distributions of the p-values. Furthermore, the estimation and use of the FDR and the closely related q-value is illustrated with examples. Five published estimates of the FDR and one new are presented and tested. Implementations in R code are available.ResultsA simulation model based on the distribution of real microarray data plus two real data sets were used to assess the methods. The proposed alternative methods for estimating the proportion unchanged fared very well, and gave evidence of low bias and very low variance. Different methods perform well depending upon whether there are few or many regulated genes. Furthermore, the methods for estimating FDR showed a varying performance, and were sometimes misleading. The new method had a very low error.ConclusionThe concept of the q-value or false discovery rate is useful in practical research, despite some theoretical and practical shortcomings. However, it seems possible to challenge the performance of the published methods, and there is likely scope for further developing the estimates of the FDR. The new methods provide the scientist with more options to choose a suitable method for any particular experiment. The article advocates the use of the conjoint information regarding false positive and negative rates as well as the proportion unchanged when identifying changed genes.

[1]  H. D. Brunk,et al.  AN EMPIRICAL DISTRIBUTION FUNCTION FOR SAMPLING WITH INCOMPLETE INFORMATION , 1955 .

[2]  Bradley Efron,et al.  Microarrays empirical Bayes methods, and false discovery rates , 2001 .

[3]  Feller William,et al.  An Introduction To Probability Theory And Its Applications , 1950 .

[4]  P. Broberg Statistical methods for ranking differentially expressed genes , 2003, Genome Biology.

[5]  James J. Chen,et al.  Multiple‐Testing Strategy for Analyzing cDNA Array Data on Gene Expression , 2004, Biometrics.

[6]  Gang Liu,et al.  Effects of cigarette smoke on the human airway epithelial cell transcriptome. , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[7]  Stephen E. Fienberg,et al.  Testing Statistical Hypotheses , 2005 .

[8]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[9]  R. Simes,et al.  An improved Bonferroni procedure for multiple tests of significance , 1986 .

[10]  D. Edwards,et al.  Statistical Analysis of Gene Expression Microarray Data , 2003 .

[11]  Ross Ihaka,et al.  Gentleman R: R: A language for data analysis and graphics , 1996 .

[12]  Y. Benjamini,et al.  On the Adaptive Control of the False Discovery Rate in Multiple Testing With Independent Statistics , 2000 .

[13]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[14]  R. McIndoe,et al.  Microarray experimental design: power and sample size considerations. , 2003, Physiological genomics.

[15]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[16]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[17]  Huey-miin Hsueh,et al.  Comparison of Methods for Estimating the Number of True Null Hypotheses in Multiplicity Testing , 2003, Journal of biopharmaceutical statistics.

[18]  Chen-An Tsai,et al.  Estimation of False Discovery Rates in Multiple Testing: Application to Gene Microarray Data , 2003, Biometrics.

[19]  S. Scheid,et al.  A stochastic downhill search algorithm for estimating the local false discovery rate , 2004, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[20]  John D. Storey A direct approach to false discovery rates , 2002 .

[21]  E. Spjøtvoll,et al.  Plots of P-values to evaluate many tests simultaneously , 1982 .

[22]  Terry Speed,et al.  Design and analysis of comparative microarray experiments , 2003 .

[23]  Jean-Jacques Daudin,et al.  Correction: Determination of the differentially expressed genes in microarray experiments using local FDR , 2005, BMC Bioinformatics.

[24]  R. Tibshirani,et al.  Using specially designed exponential families for density estimation , 1996 .

[25]  Stan Pounds,et al.  Estimating the Occurrence of False Positives and False Negatives in Microarray Studies by Approximating and Partitioning the Empirical Distribution of P-values , 2003, Bioinform..

[26]  Rainer Spang,et al.  A false discovery rate approach to separate the score distributions of induced and non-induced genes , 2003 .

[27]  Weichung Joe Shih,et al.  A mixture model for estimating the local false discovery rate in DNA microarray analysis , 2004, Bioinform..

[28]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[29]  Bradley Efron Stanford Selection and Estimation for Large-Scale Simultaneous Inference , 2004 .

[30]  David B. Allison,et al.  A mixture model approach for the analysis of microarray gene expression data , 2002 .

[31]  Cheng Cheng,et al.  Improving false discovery rate estimation , 2004, Bioinform..

[32]  B. Efron Large-Scale Simultaneous Hypothesis Testing , 2004 .

[33]  Yoav Benjamini,et al.  Identifying differentially expressed genes using false discovery rate controlling procedures , 2003, Bioinform..

[34]  Yongchao Ge Resampling-based Multiple Testing for Microarray Data Analysis , 2003 .