Marginal asymptotics for the “large $p$, small $n$” paradigm: With applications to microarray data

The "large p, small n" paradigm arises in microarray studies, image analysis, high throughput molecular screening, astronomy, and in many other high dimensional applications. False discovery rate (FDR) methods are useful for resolving the accompanying multiple testing problems. In cDNA microarray studies, for example, p-values may be computed for each of p genes using data from n arrays, where typically p is in the thousands and n is less than 30. For FDR methods to be valid in identifying differentially expressed genes, the p-values for the nondifferentially expressed genes must simultaneously have uniform distributions marginally. While feasible for permutation p-values, this uniformity is problematic for asymptotic based p-values since the number of p-values involved goes to infinity and intuition suggests that at least some of the p-values should behave erratically. We examine this neglected issue when n is moderately large but p is almost exponentially large relative to n. We show the somewhat surprising result that, under very general dependence structures and for both mean and median tests, the p-values are simultaneously valid. A small simulation study and data analysis are used for illustration.

[1]  J. Kiefer,et al.  Asymptotic Minimax Character of the Sample Distribution Function and of the Classical Multinomial Estimator , 1956 .

[2]  P. Major,et al.  An approximation of partial sums of independent RV'-s, and the sample DF. I , 1975 .

[3]  P. Massart,et al.  HUNGARIAN CONSTRUCTIONS FROM THE NONASYMPTOTIC VIEWPOINT , 1989 .

[4]  P. Massart The Tight Constant in the Dvoretzky-Kiefer-Wolfowitz Inequality , 1990 .

[5]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[6]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[7]  M. Kosorok Two-sample quantile tests under general conditions , 1999 .

[8]  Terence P. Speed,et al.  Normalization for cDNA microarry data , 2001, SPIE BiOS.

[9]  Mike West,et al.  Prediction and uncertainty in the analysis of gene expression profiles , 2002, Silico Biol..

[10]  M J van der Laan,et al.  Gene expression analysis with the parametric bootstrap. , 2001, Biostatistics.

[11]  R. Spang,et al.  Predicting the clinical status of human breast cancer by using gene expression profiles , 2001, Proceedings of the National Academy of Sciences of the United States of America.

[12]  S. Dudoit,et al.  Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data , 2002 .

[13]  Matthew West,et al.  Bayesian factor regression models in the''large p , 2003 .

[14]  Terry Speed,et al.  Normalization of cDNA microarray data. , 2003, Methods.

[15]  John D. Storey,et al.  Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach , 2004 .

[16]  Jian Huang,et al.  A Two-Way Semilinear Model for Normalization and Analysis of cDNA Microarray Data , 2005 .

[17]  Jiang Gui,et al.  Penalized Cox regression analysis in the high-dimensional and low-sample size settings, with applications to microarray gene expression data , 2005, Bioinform..

[18]  Debashis Ghosh,et al.  Classification and Selection of Biomarkers in Genomic Data Using LASSO , 2005, Journal of biomedicine & biotechnology.

[19]  Jianqing Fan,et al.  Semilinear High-Dimensional Model for Normalization of Microarray Data , 2005 .