How accurately can we control the FDR in analyzing microarray data?

SUMMARY We want to evaluate the performance of two FDR-based multiple testing procedures by Benjamini and Hochberg (1995, J. R. Stat. Soc. Ser. B, 57, 289-300) and Storey (2002, J. R. Stat. Soc. Ser. B, 64, 479-498) in analyzing real microarray data. These procedures commonly require independence or weak dependence of the test statistics. However, expression levels of different genes from each array are usually correlated due to coexpressing genes and various sources of errors from experiment-specific and subject-specific conditions that are not adjusted for in data analysis. Because of high dimensionality of microarray data, it is usually impossible to check whether the weak dependence condition is met for a given dataset or not. We propose to generate a large number of test statistics from a simulation model which has asymptotically (in terms of the number of arrays) the same correlation structure as the test statistics that will be calculated from the given data and to investigate how accurately the FDR-based testing procedures control the FDR on the simulated data. Our approach is to directly check the performance of these procedures for a given dataset, rather than to check the weak dependency requirement. We illustrate the proposed method with real microarray datasets, one where the clinical endpoint is disease group and another where it is survival.

[1]  R. Prentice,et al.  Commentary on Andersen and Gill's "Cox's Regression Model for Counting Processes: A Large Sample Study" , 1982 .

[2]  W. Press,et al.  Numerical Recipes: The Art of Scientific Computing , 1987 .

[3]  Jean YH Yang,et al.  Bioconductor: open software development for computational biology and bioinformatics , 2004, Genome Biology.

[4]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[5]  S. S. Young,et al.  Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[6]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[7]  R. Gill,et al.  Cox's regression model for counting processes: a large sample study : (preprint) , 1982 .

[8]  Yifan Huang,et al.  To permute or not to permute , 2006, Bioinform..

[9]  L. J. Wei,et al.  The Robust Inference for the Cox Proportional Hazards Model , 1989 .

[10]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[11]  Kouros Owzar,et al.  A multiple testing procedure to associate gene expression levels with survival , 2005, Statistics in medicine.

[12]  David E. Misek,et al.  Gene-expression profiles predict survival of patients with lung adenocarcinoma , 2002, Nature Medicine.

[13]  D. Cox Regression Models and Life-Tables , 1972 .

[14]  Sin-Ho Jung,et al.  Sample size for FDR-control in microarray data analysis , 2005, Bioinform..

[15]  Joseph P. Romano,et al.  Generalizations of the familywise error rate , 2005, math/0507420.

[16]  John D. Storey The positive false discovery rate: a Bayesian interpretation and the q-value , 2003 .

[17]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[18]  M. J. van der Laan,et al.  Augmentation Procedures for Control of the Generalized Family-Wise Error Rate and Tail Probabilities for the Proportion of False Positives , 2004, Statistical applications in genetics and molecular biology.

[19]  John D. Storey A direct approach to false discovery rates , 2002 .

[20]  John D. Storey,et al.  Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach , 2004 .

[21]  D. Y. Lin,et al.  An efficient Monte Carlo approach to assessing statistical significance in genomic studies , 2005, Bioinform..