RAPTT: An Exact Two-Sample Test in High Dimensions Using Random Projections

In high dimensions, the classical Hotelling’s T2 test tends to have low power or becomes undefined due to singularity of the sample covariance matrix. In this article, this problem is overcome by projecting the data matrix onto lower dimensional subspaces through multiplication by random matrices. We propose RAPTT (RAndom Projection T2-Test), an exact test for equality of means of two normal populations based on projected lower dimensional data. RAPTT does not require any constraints on the dimension of the data or the sample size. A simulation study indicates that in high dimensions the power of this test is often greater than that of competing tests. The advantages of RAPTT are illustrated on a high-dimensional gene expression dataset involving the discrimination of tumor and normal colon tissues.

[1]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[2]  Juan Antonio Cuesta-Albertos,et al.  The random projection method in goodness of fit for functional data , 2007, Comput. Stat. Data Anal..

[3]  Ping Li,et al.  Hashing Algorithms for Large-Scale Learning , 2011, NIPS.

[4]  Martin J. Wainwright,et al.  A More Powerful Two-Sample Test in High Dimensions using Random Projection , 2011, NIPS.

[5]  N. L. Johnson,et al.  Multivariate Analysis , 1958, Nature.

[6]  Stéphan Clémençon,et al.  AUC optimization and the two-sample problem , 2009, NIPS.

[7]  R. Tibshirani,et al.  Penalized classification using Fisher's linear discriminant , 2011, Journal of the Royal Statistical Society. Series B, Statistical methodology.

[8]  Song-xi Chen,et al.  Tests for High-Dimensional Covariance Matrices , 2010, Random Matrices: Theory and Applications.

[9]  J. Kuelbs,et al.  Asymptotic inference for high-dimensional data , 2010, 1002.4554.

[10]  Debashis Paul,et al.  A Regularized Hotelling’s T2 Test for Pathway Analysis in Proteomic Studies , 2011, Journal of the American Statistical Association.

[11]  Jianqing Fan,et al.  To How Many Simultaneous Hypothesis Tests Can Normal, Student's t or Bootstrap Calibration Be Applied? , 2006, math/0701003.

[12]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[13]  Jun Yu Li,et al.  Two Sample Tests for High Dimensional Covariance Matrices , 2012, 1206.0917.

[14]  Z. Bai,et al.  EFFECT OF HIGH DIMENSION: BY AN EXAMPLE OF A TWO SAMPLE PROBLEM , 1999 .

[15]  M. Ringnér,et al.  Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks , 2001, Nature Medicine.

[16]  M. Srivastava Multivariate Theory for Analyzing High Dimensional Data , 2007 .

[17]  Hadley Wickham,et al.  The Split-Apply-Combine Strategy for Data Analysis , 2011 .

[18]  S. Dudoit,et al.  Gains in Power from Structured Two-Sample Tests of Means on Graphs , 2010, 1009.5173.

[19]  Song-xi Chen,et al.  A two-sample test for high-dimensional data with applications to gene-set testing , 2010, 1002.4547.

[20]  Liang Peng,et al.  Jackknife Empirical Likelihood Test for Equality of Two High Dimensional Means , 2013 .

[21]  Richard E. Neapolitan Analyzing Gene Expression Data , 2009 .

[22]  Dianne Cook,et al.  A projection pursuit index for large p small n data , 2010, Stat. Comput..

[23]  Thomas L. Marzetta,et al.  A Random Matrix-Theoretic Approach to Handling Singular Covariance Estimates , 2011, IEEE Transactions on Information Theory.

[24]  Peng Xiao,et al.  Hotelling’s T 2 multivariate profiling for detecting differential expression in microarrays , 2005 .

[25]  Olivier Ledoit,et al.  Some hypothesis tests for the covariance matrix when the dimension is large compared to the sample size , 2002 .

[26]  D. Freedman,et al.  Asymptotics of Graphical Projection Pursuit , 1984 .

[27]  Santosh S. Vempala,et al.  The Random Projection Method , 2005, DIMACS Series in Discrete Mathematics and Theoretical Computer Science.

[28]  Måns Thulin,et al.  A high-dimensional two-sample test for the mean using random subspaces , 2013, Comput. Stat. Data Anal..

[29]  Dimitri Van De Ville,et al.  Integrated wavelet processing and spatial statistical testing of fMRI data , 2004, NeuroImage.

[30]  M. Srivastava,et al.  A test for the mean vector with fewer observations than the dimension , 2008 .

[31]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[32]  R. Berk,et al.  Continuous Univariate Distributions, Volume 2 , 1995 .

[33]  M J van der Laan,et al.  Gene expression analysis with the parametric bootstrap. , 2001, Biostatistics.

[34]  M. Kosorok,et al.  Marginal asymptotics for the “large $p$, small $n$” paradigm: With applications to microarray data , 2005, math/0508219.

[35]  Kenneth Ward Church,et al.  Very sparse random projections , 2006, KDD '06.

[36]  J. MacKinnon Bootstrap Hypothesis Testing , 2007 .

[37]  N. L. Johnson,et al.  Continuous Univariate Distributions. , 1995 .

[38]  Peter Bühlmann,et al.  Analyzing gene expression data in terms of gene sets: methodological issues , 2007, Bioinform..

[39]  Heike Hofmann,et al.  Tourr: An R package for exploring multivariate data with projections , 2011 .