Correlated z-Values and the Accuracy of Large-Scale Statistical Estimates

We consider large-scale studies in which there are hundreds or thousands of correlated cases to investigate, each represented by its own normal variate, typically a z-value. A familiar example is provided by a microarray experiment comparing healthy with sick subjects’ expression levels for thousands of genes. This paper concerns the accuracy of summary statistics for the collection of normal variates, such as their empirical cdf or a false discovery rate statistic. It seems like we must estimate an N by N correlation matrix, N the number of cases, but our main result shows that this is not necessary: good accuracy approximations can be based on the root mean square correlation over all N ⋅ (N − 1)/2 pairs, a quantity often easily estimated. A second result shows that z-values closely follow normal distributions even under nonnull conditions, supporting application of the main theorem. Practical application of the theory is illustrated for a large leukemia microarray study.

[1]  H. O. Lancaster The Structure of Bivariate Distributions , 1958 .

[2]  P. R. Fisk,et al.  Distributions in Statistics: Continuous Multivariate Distributions , 1971 .

[3]  W. R. Buckland,et al.  Distributions in Statistics: Continuous Multivariate Distributions , 1973 .

[4]  B. Efron Bootstrap confidence intervals for a class of parametric problems , 1985 .

[5]  B. Efron Better Bootstrap Confidence Intervals , 1987 .

[6]  Bradley Efron Better Bootstrap Confidence Intervals: Rejoinder , 1987 .

[7]  S. S. Young,et al.  Resampling-Based Multiple Testing: Examples and Methods for p-Value Adjustment , 1993 .

[8]  Danny Kopec,et al.  Additional References , 2003 .

[9]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[10]  S. Blinnikov,et al.  Expansions for nearly Gaussian distributions , 1997 .

[11]  E. Mammen The Bootstrap and Edgeworth Expansion , 1997 .

[12]  G. Pennello The k-Ratio Multiple Comparisons Bayes Rule for the Balanced Two-Way Design , 1997 .

[13]  Angelo Efoévi Koudou,et al.  Lancaster bivariate probability distributions with Poisson, negative binomial and gamma margins , 1998 .

[14]  J. Mesirov,et al.  Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. , 1999, Science.

[15]  Phhilippe Jorion Value at Risk: The New Benchmark for Managing Financial Risk , 2000 .

[16]  J. Troendle,et al.  Stepwise normal theory multiple test procedures controlling the false discovery rate , 2000 .

[17]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[18]  H. Finner,et al.  Multiple hypotheses testing and expected number of type I. errors , 2002 .

[19]  Terence P. Speed,et al.  A comparison of normalization methods for high density oligonucleotide array data based on variance and bias , 2003, Bioinform..

[20]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[21]  E. Korn,et al.  An Example of Slow Convergence of the Bootstrap in High Dimensions , 2004 .

[22]  Xing Qiu,et al.  The effects of normalization on the correlation structure of microarray data , 2005, BMC Bioinformatics.

[23]  Sandrine Dudoit,et al.  Multiple Testing. Part I. Single-Step Procedures for Control of General Type I Error Rates , 2004, Statistical applications in genetics and molecular biology.

[24]  Ying Zhang,et al.  Multiple Comparison of Several Linear Regression Models , 2004 .

[25]  B. Efron Correlation and Large-Scale Simultaneous Significance Testing , 2007 .

[26]  J. Ioannidis Contradicted and initially stronger effects in highly cited clinical research. , 2005, JAMA.

[27]  A. Owen Variance of the number of false discoveries , 2005 .

[28]  Samuel Kotz,et al.  Bivariate and Trivariate Normal Distributions , 2005, The Multivariate Normal Distribution.

[29]  Sfindor Cs rg The empirical process of a short-range dependent stationary sequence under Gaussian subordination , 2005 .

[30]  Xing Qiu,et al.  Correlation Between Gene Expression Levels and Limitations of the Empirical Bayes Methodology for Finding Differentially Expressed Genes , 2005, Statistical applications in genetics and molecular biology.

[31]  T. Cai,et al.  Estimating the Null and the Proportion of Nonnull Effects in Large-Scale Multiple Comparisons , 2006, math/0611108.

[32]  Yudi Pawitan,et al.  Estimation of false discovery proportion under general dependence , 2006, Bioinform..

[33]  B. Efron Size, power and false discovery rates , 2007, 0710.2245.

[34]  D. Blacker,et al.  Systematic meta-analyses of Alzheimer disease genetic association studies: the AlzGene database , 2007, Nature Genetics.

[35]  A. Farcomeni Some Results on the Control of the False Discovery Rate under Dependence , 2007 .

[36]  Y. Benjamini,et al.  False Discovery Rates for Spatial Signals , 2007 .

[37]  B. Efron SIMULTANEOUS INFERENCE : WHEN SHOULD HYPOTHESIS TESTING PROBLEMS BE COMBINED? , 2008, 0803.3863.

[38]  Paolo Boffetta,et al.  False-Positive Results in Cancer Epidemiology: A Plea for Epistemological Modesty , 2008, Journal of the National Cancer Institute.

[39]  W. Wu,et al.  On false discovery control under dependence , 2008, 0803.1971.

[40]  Bradley Efron,et al.  Microarrays, Empirical Bayes and the Two-Groups Model. Rejoinder. , 2008, 0808.0572.

[41]  M. Newton Large-Scale Simultaneous Hypothesis Testing: The Choice of a Null Hypothesis , 2008 .

[42]  Armin Schwartzman,et al.  Empirical null and false discovery rate inference for exponential families , 2008, 0901.4007.

[43]  Stuart G Baker,et al.  Using microarrays to study the microenvironment in tumor biology: the crucial role of statistics. , 2008, Seminars in cancer biology.

[44]  Daniel Yekutieli Comments on: Control of the false discovery rate under dependence using the bootstrap and subsampling , 2008 .

[45]  James F Troendle,et al.  Multiple Testing with Minimal Assumptions , 2008, Biometrical journal. Biometrische Zeitschrift.

[46]  Joseph P. Romano,et al.  Control of the false discovery rate under dependence using the bootstrap and subsampling , 2008 .

[47]  P. Hall,et al.  Robustness of multiple testing procedures against dependence , 2009, 0903.0464.

[48]  P. Westfall,et al.  Is Bonferroni Admissible for Large m? , 2009 .

[49]  Wenguang Sun,et al.  Large‐scale multiple testing under dependence , 2009 .

[50]  B. Efron Are a set of microarrays independent of each other? , 2009, The annals of applied statistics.

[51]  Gregory R. Grant,et al.  A flexible two-stage procedure for identifying gene sets that are differentially expressed , 2009, Bioinform..

[52]  A. Genz,et al.  Computation of Multivariate Normal and t Probabilities , 2009 .

[53]  Jiashun Jin,et al.  Optimal rates of convergence for estimating the null density and proportion of nonnull effects in large-scale multiple testing , 2010, 1001.1609.

[54]  Xihong Lin,et al.  The effect of correlation in false discovery rate estimation. , 2011, Biometrika.