A high-dimensional two-sample test for the mean using random subspaces

A common problem in genetics is that of testing whether a set of highly dependent gene expressions differ between two populations, typically in a high-dimensional setting where the data dimension is larger than the sample size. Most high-dimensional tests for the equality of two mean vectors rely on naive diagonal or trace estimators of the covariance matrix, ignoring dependences between variables. A test using random subspaces is proposed, which offers higher power when the variables are dependent and is invariant under linear transformations of the marginal distributions. The p-values for the test are obtained using permutations. The test does not rely on assumptions about normality or the structure of the covariance matrix. It is shown by simulation that the new test has higher power than competing tests in realistic settings motivated by microarray gene expression data. Computational aspects of high-dimensional permutation tests are also discussed and an efficient R implementation of the proposed test is provided.

[1]  E. Lehmann Testing Statistical Hypotheses , 1960 .

[2]  Heikki Mannila,et al.  Random projection in dimensionality reduction: applications to image and text data , 2001, KDD '01.

[3]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[4]  Anne Vincent-Salomon,et al.  A prognostic DNA signature for T1T2 node‐negative breast cancer patients , 2010, Genes, chromosomes & cancer.

[5]  M. Newton,et al.  Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis , 2007, 0708.4350.

[6]  Ricardo Fraiman,et al.  Resistant estimates for high dimensional and functional data based on random projections , 2011, Comput. Stat. Data Anal..

[7]  James R. Schott,et al.  A test for the equality of covariance matrices when the dimension is large relative to the sample sizes , 2007, Comput. Stat. Data Anal..

[8]  Jan Mielniczuk,et al.  Using random subspace method for prediction and variable importance assessment in linear regression , 2014, Comput. Stat. Data Anal..

[9]  Insuk Sohn,et al.  Multiple testing for gene sets from microarray experiments , 2011, BMC Bioinformatics.

[10]  Peter Bühlmann,et al.  Analyzing gene expression data in terms of gene sets: methodological issues , 2007, Bioinform..

[11]  Sandrine Dudoit,et al.  More power via graph-structured tests for differential expression of gene networks , 2012, 1206.6980.

[12]  B. Efron Correlation and Large-Scale Simultaneous Significance Testing , 2007 .

[13]  S. Dudoit,et al.  Resampling-based multiple testing for microarray data analysis , 2003 .

[14]  John D. Storey,et al.  Empirical Bayes Analysis of a Microarray Experiment , 2001 .

[15]  J. S. Marron,et al.  Direction-Projection-Permutation for High-Dimensional Hypothesis Tests , 2013, 1304.0796.

[16]  Andrew B. Nobel,et al.  Significance analysis of functional categories in gene expression studies: a structured permutation approach , 2005, Bioinform..

[17]  Giorgio Valentini,et al.  Bio-molecular cancer prediction with random subspace ensembles of support vector machines , 2005, Neurocomputing.

[18]  Marcel J. T. Reinders,et al.  Random subspace method for multivariate feature selection , 2006, Pattern Recognit. Lett..

[19]  R. Tibshirani,et al.  On testing the significance of sets of genes , 2006, math/0610667.

[20]  J. A. Cuesta-Albertos,et al.  Random projections and goodness-of-fit tests in infinite-dimensional spaces , 2006 .

[21]  T. Cai,et al.  Two-Sample Covariance Matrix Testing and Support Recovery in High-Dimensional and Sparse Settings , 2013 .

[22]  Andrew B. Nobel,et al.  A statistical framework for testing functional categories in microarray data , 2008, 0803.3881.

[23]  Dan Nettleton,et al.  Identification of differentially expressed gene categories in microarray studies using nonparametric multivariate analysis , 2008, Bioinform..

[24]  M. Srivastava Multivariate Theory for Analyzing High Dimensional Data , 2007 .

[25]  Axel Gandy,et al.  Subspace Methods for Anomaly Detection in High Dimensional As - , 2011 .

[26]  Carla E. Brodley,et al.  Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach , 2003, ICML.

[27]  S. Dudoit,et al.  Multiple Testing Procedures with Applications to Genomics , 2007 .

[28]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[29]  A. Nobel,et al.  Heading Down the Wrong Pathway: on the Influence of Correlation within Gene Sets , 2010, BMC Genomics.

[30]  P. Filzmoser,et al.  Random projection experiments with chemometric data , 2010 .

[31]  Martin J. Wainwright,et al.  A More Powerful Two-Sample Test in High Dimensions using Random Projection , 2011, NIPS.

[32]  Conrad Sanderson,et al.  RcppArmadillo: Accelerating R with high-performance C++ linear algebra , 2014, Comput. Stat. Data Anal..

[33]  Muni S. Srivastava,et al.  A two sample test in high dimensional data , 2013, Journal of Multivariate Analysis.

[34]  Song-xi Chen,et al.  A two-sample test for high-dimensional data with applications to gene-set testing , 2010, 1002.4547.

[35]  Z. Bai,et al.  EFFECT OF HIGH DIMENSION: BY AN EXAMPLE OF A TWO SAMPLE PROBLEM , 1999 .

[36]  M. Srivastava,et al.  A test for the mean vector with fewer observations than the dimension , 2008 .

[37]  S. Dudoit,et al.  Multiple Hypothesis Testing in Microarray Experiments , 2003 .

[38]  Jun Yu Li,et al.  Two Sample Tests for High Dimensional Covariance Matrices , 2012, 1206.0917.

[39]  Santosh S. Vempala,et al.  The Random Projection Method , 2005, DIMACS Series in Discrete Mathematics and Theoretical Computer Science.

[40]  Travis Atkison,et al.  Using randomized projection techniques to aid in detecting high-dimensional malicious applications , 2011, ACM-SE '11.