A two-sample test for high-dimensional data with applications to gene-set testing

We proposed a two sample test for means of high dimensional data when the data dimension is much larger than the sample size. The classical Hotelling's $T^2$ test does not work for this ``large p, small n" situation. The proposed test does not require explicit conditions on the relationship between the data dimension and sample size. This offers much flexibility in analyzing high dimensional data. An application of the proposed test is in testing significance for sets of genes, which we demonstrate in an empirical study on a Leukemia data set.

[1]  T. W. Anderson An Introduction to Multivariate Statistical Analysis , 1959 .

[2]  P. Hall,et al.  Martingale Limit Theory and Its Application , 1980 .

[3]  S. Portnoy On the central limit theorem in Rp when p→∞ , 1986 .

[4]  Z. Bai,et al.  On the limit of the largest eigenvalue of the large dimensional sample covariance matrix , 1988 .

[5]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[6]  C. Tracy,et al.  Mathematical Physics © Springer-Verlag 1996 On Orthogonal and Symplectic Matrix Ensembles , 1995 .

[7]  Z. Bai,et al.  EFFECT OF HIGH DIMENSION: BY AN EXAMPLE OF A TWO SAMPLE PROBLEM , 1999 .

[8]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[9]  M J van der Laan,et al.  Gene expression analysis with the parametric bootstrap. , 2001, Biostatistics.

[10]  Olivier Ledoit,et al.  Some hypothesis tests for the covariance matrix when the dimension is large compared to the sample size , 2002 .

[11]  John D. Storey,et al.  Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach , 2004 .

[12]  R. Gentleman,et al.  Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. , 2004, Blood.

[13]  I. Johnstone,et al.  Adapting to unknown sparsity by controlling the false discovery rate , 2005, math/0505374.

[14]  Jian Huang,et al.  A Two-Way Semilinear Model for Normalization and Analysis of cDNA Microarray Data , 2005 .

[15]  M. Kosorok,et al.  Marginal asymptotics for the “large $p$, small $n$” paradigm: With applications to microarray data , 2005, math/0508219.

[16]  Andrew B. Nobel,et al.  Significance analysis of functional categories in gene expression studies: a structured permutation approach , 2005, Bioinform..

[17]  James R. Schott,et al.  Testing for complete independence in high dimensions , 2005 .

[18]  Rafael A. Irizarry,et al.  Bioinformatics and Computational Biology Solutions using R and Bioconductor , 2005 .

[19]  Jianqing Fan,et al.  Semilinear High-Dimensional Model for Normalization of Microarray Data , 2005 .

[20]  R. Tibshirani,et al.  On testing the significance of sets of genes , 2006, math/0610667.

[21]  Jianqing Fan,et al.  To How Many Simultaneous Hypothesis Tests Can Normal, Student's t or Bootstrap Calibration Be Applied? , 2006, math/0701003.

[22]  Sandrine Dudoit,et al.  Multiple tests of association with biological annotation metadata , 2008, 0805.3008.

[23]  M. Newton,et al.  Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis , 2007, 0708.4350.

[24]  J. Davis Bioinformatics and Computational Biology Solutions Using R and Bioconductor , 2007 .

[25]  Dan Nettleton,et al.  Identification of differentially expressed gene categories in microarray studies using nonparametric multivariate analysis , 2008, Bioinform..