A More Powerful Two-Sample Test in High Dimensions using Random Projection

We consider the hypothesis testing problem of detecting a shift between the means of two multivariate normal distributions in the high-dimensional setting, allowing for the data dimension p to exceed the sample size n. Our contribution is a new test statistic for the two-sample test of means that integrates a random projection with the classical Hotelling T2 statistic. Working within a high-dimensional framework that allows (p, n) → ∞, we first derive an asymptotic power function for our test, and then provide sufficient conditions for it to achieve greater power than other state-of-the-art tests. Using ROC curves generated from simulated data, we demonstrate superior performance against competing tests in the parameter regimes anticipated by our theoretical results. Lastly, we illustrate an advantage of our procedure with comparisons on a high-dimensional gene expression dataset involving the discrimination of different types of cancer.

[1]  A. Dempster A HIGH DIMENSIONAL TWO SAMPLE SIGNIFICANCE TEST , 1958 .

[2]  T. W. Anderson An Introduction to Multivariate Statistical Analysis , 1959 .

[3]  E. Lehmann Testing Statistical Hypotheses , 1960 .

[4]  A. Dempster A significance test for the separation of two highly multivariate small samples , 1960 .

[5]  P. Billingsley,et al.  Probability and Measure , 1980 .

[6]  G. Stewart The Efficient Generation of Random Orthogonal Matrices with an Application to Condition Estimators , 1980 .

[7]  R. Muirhead Aspects of Multivariate Statistical Theory , 1982, Wiley Series in Probability and Statistics.

[8]  D. Freedman,et al.  Asymptotics of Graphical Projection Pursuit , 1984 .

[9]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[10]  J. W. Silverstein The Smallest Eigenvalue of a Large Dimensional Wishart Matrix , 1985 .

[11]  W. Beckner A generalized Poincaré inequality for Gaussian measures , 1989 .

[12]  Michael Unser,et al.  Statistical analysis of functional MRI data in the wavelet domain , 1998, IEEE Transactions on Medical Imaging.

[13]  Z. Bai,et al.  EFFECT OF HIGH DIMENSION: BY AN EXAMPLE OF A TWO SAMPLE PROBLEM , 1999 .

[14]  J. Borwein,et al.  Convex Analysis And Nonlinear Optimization , 2000 .

[15]  P. Massart,et al.  Adaptive estimation of a quadratic functional by model selection , 2000 .

[16]  I. Johnstone On the distribution of the largest eigenvalue in principal components analysis , 2001 .

[17]  S. Szarek,et al.  Chapter 8 - Local Operator Theory, Random Matrices and Banach Spaces , 2001 .

[18]  Santosh S. Vempala,et al.  The Random Projection Method , 2005, DIMACS Series in Discrete Mathematics and Theoretical Computer Science.

[19]  T. Speed,et al.  GOstat: find statistically overrepresented Gene Ontologies within a group of genes. , 2004, Bioinformatics.

[20]  Dimitri Van De Ville,et al.  Integrated wavelet processing and spatial statistical testing of fMRI data , 2004, NeuroImage.

[21]  I. Johnstone,et al.  Adapting to unknown sparsity by controlling the false discovery rate , 2005, math/0505374.

[22]  T. Kollo,et al.  Advanced Multivariate Statistics with Matrices , 2005 .

[23]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[24]  Peng Xiao,et al.  Hotelling’s T 2 multivariate profiling for detecting differential expression in microarrays , 2005 .

[25]  Bernhard Schölkopf,et al.  A Kernel Method for the Two-Sample-Problem , 2006, NIPS.

[26]  P. Massart,et al.  Concentration inequalities and model selection , 2007 .

[27]  Zaïd Harchaoui,et al.  Testing for Homogeneity with Kernel Fisher Discriminant Analysis , 2007, NIPS.

[28]  Peter Bühlmann,et al.  Analyzing gene expression data in terms of gene sets: methodological issues , 2007, Bioinform..

[29]  Juan Antonio Cuesta-Albertos,et al.  The random projection method in goodness of fit for functional data , 2007, Comput. Stat. Data Anal..

[30]  R. Tothill,et al.  Novel Molecular Subtypes of Serous and Endometrioid Ovarian Cancer Linked to Clinical Outcome , 2008, Clinical Cancer Research.

[31]  M. Srivastava,et al.  A test for the mean vector with fewer observations than the dimension , 2008 .

[32]  T. Ørntoft,et al.  Metastasis-Associated Gene Expression Changes Predict Poor Outcomes in Patients with Dukes Stage B and C Colorectal Cancer , 2009, Clinical Cancer Research.

[33]  Stéphan Clémençon,et al.  AUC optimization and the two-sample problem , 2009, NIPS.

[34]  Muni S. Srivastava,et al.  A test for the mean vector with fewer observations than the dimension under non-normality , 2009, J. Multivar. Anal..

[35]  I. Bechar,et al.  A Bernstein-type inequality for stochastic processes of quadratic forms of Gaussian variables , 2009, 0909.3595.

[36]  S. Dudoit,et al.  Gains in Power from Structured Two-Sample Tests of Means on Graphs , 2010, 1009.5173.

[37]  Louis H. Y. Chen,et al.  Normal Approximation by Stein's Method , 2010 .

[38]  Song-xi Chen,et al.  A two-sample test for high-dimensional data with applications to gene-set testing , 2010, 1002.4547.

[39]  T. Rème,et al.  A high-risk signature for patients with multiple myeloma established from the molecular classification of human myeloma cell lines , 2011, Haematologica.

[40]  Thomas L. Marzetta,et al.  A Random Matrix-Theoretic Approach to Handling Singular Covariance Estimates , 2011, IEEE Transactions on Information Theory.

[41]  Gongguo Tang,et al.  The Stability of Low-Rank Matrix Reconstruction: A Constrained Singular Value View , 2010, IEEE Transactions on Information Theory.