A general framework for multiple testing dependence

We develop a general framework for performing large-scale significance testing in the presence of arbitrarily strong dependence. We derive a low-dimensional set of random vectors, called a dependence kernel, that fully captures the dependence structure in an observed high-dimensional dataset. This result shows a surprising reversal of the “curse of dimensionality” in the high-dimensional hypothesis testing setting. We show theoretically that conditioning on a dependence kernel is sufficient to render statistical tests independent regardless of the level of dependence in the observed data. This framework for multiple testing dependence has implications in a variety of common multiple testing problems, such as in gene expression studies, brain imaging, and spatial epidemiology.

[1]  A. Owen Variance of the number of false discoveries , 2005 .

[2]  A. Buja,et al.  Remarks on Parallel Analysis. , 1992, Multivariate behavioral research.

[3]  K. Worsley Detecting activation in fMRI data , 2003, Statistical methods in medical research.

[4]  J. Anderson,et al.  Penalized maximum likelihood estimation in logistic regression and discrimination , 1982 .

[5]  John D. Storey,et al.  Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach , 2004 .

[6]  Christopher J. Miller,et al.  Controlling the False-Discovery Rate in Astrophysical Data Analysis , 2001, astro-ph/0107034.

[7]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[8]  J. Rice Mathematical Statistics and Data Analysis , 1988 .

[9]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[10]  B. Silverman,et al.  Nonparametric regression and generalized linear models , 1994 .

[11]  M. Newton Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis , 2008 .

[12]  Jean-Luc Starck,et al.  Weak lensing mass reconstruction using wavelets , 2005, astro-ph/0503373.

[13]  R. Tibshirani,et al.  Significance analysis of microarrays applied to the ionizing radiation response , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[14]  Deepayan Sarkar,et al.  Detecting differential gene expression with a semiparametric hierarchical mixture method. , 2004, Biostatistics.

[15]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[16]  Richard Bellman,et al.  Adaptive Control Processes: A Guided Tour , 1961, The Mathematical Gazette.

[17]  D. Reich,et al.  Principal components analysis corrects for stratification in genome-wide association studies , 2006, Nature Genetics.

[18]  John D. Storey,et al.  Capturing Heterogeneity in Gene Expression Studies by Surrogate Variable Analysis , 2007, PLoS genetics.

[19]  A. Yakovlev,et al.  A New Type of Stochastic Dependence Revealed in Gene Expression Data , 2006, Statistical applications in genetics and molecular biology.

[20]  S. Scobie Spatial epidemiology: methods and applications , 2003 .

[21]  B. Efron Large-Scale Simultaneous Hypothesis Testing , 2004 .

[22]  Y. Benjamini,et al.  Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics , 1999 .

[23]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[24]  J. Shaffer Multiple Hypothesis Testing , 1995 .

[25]  John D. Storey,et al.  Multiple Locus Linkage Analysis of Genomewide Expression in Yeast , 2005, PLoS biology.

[26]  Stephen E. Fienberg,et al.  Testing Statistical Hypotheses , 2005 .

[27]  J. Pritchard,et al.  Use of unlinked genetic markers to detect population stratification in association studies. , 1999, American journal of human genetics.

[28]  John D. Storey,et al.  Statistical significance for genomewide studies , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[29]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[30]  John D. Storey A direct approach to false discovery rates , 2002 .

[31]  R. Dougherty,et al.  FALSE DISCOVERY RATE ANALYSIS OF BRAIN DIFFUSION DIRECTION MAPS. , 2008, The annals of applied statistics.

[32]  K. Roeder,et al.  Genomic Control for Association Studies , 1999, Biometrics.

[33]  Karl J. Friston,et al.  A unified statistical approach for determining significant signals in images of cerebral activation , 1996, Human brain mapping.

[34]  Ludger Rüschendorf,et al.  On regression representations of stochastic processes , 1993 .

[35]  Xing Qiu,et al.  Correlation Between Gene Expression Levels and Limitations of the Empirical Bayes Methodology for Finding Differentially Expressed Genes , 2005, Statistical applications in genetics and molecular biology.

[36]  J. Wakefield,et al.  Spatial epidemiology: methods and applications. , 2000 .

[37]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[38]  B. Sorić Statistical “Discoveries” and Effect-Size Estimation , 1989 .

[39]  Thomas E. Nichols,et al.  Thresholding of Statistical Maps in Functional Neuroimaging Using the False Discovery Rate , 2002, NeuroImage.

[40]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[41]  John D. Storey,et al.  Lymphocyte Anergy in Patients with Carcinoma , 1973, British Journal of Cancer.