Two-sample testing in high dimensions

Summary We propose new methodology for two-sample testing in high dimensional models. The methodology provides a high dimensional analogue to the classical likelihood ratio test and is applicable to essentially any model class where sparse estimation is feasible. Sparse structure is used in the construction of the test statistic. In the general case, testing then involves non-nested model comparison, and we provide asymptotic results for the high dimensional setting. We put forward computationally efficient procedures based on data splitting, including a variant of the permutation test that exploits sparse structure. We illustrate the general approach in two-sample comparisons of high dimensional regression models (‘differential regression’) and graphical models (‘differential network’), showing results on simulated data as well as data from two recent cancer studies.

[1]  Wen Zhou,et al.  Simulation‐based hypothesis testing of high dimensional means under covariance heterogeneity , 2014, Biometrics.

[2]  Anil K. Ghosh,et al.  A nonparametric two-sample test applicable to high dimensional data , 2014, J. Multivar. Anal..

[3]  Sach Mukherjee,et al.  Multivariate gene-set testing based on graphical models. , 2015, Biostatistics.

[4]  Pablo Tamayo,et al.  Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles , 2005, Proceedings of the National Academy of Sciences of the United States of America.

[5]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[6]  Peter Bühlmann,et al.  p-Values for High-Dimensional Regression , 2008, 0811.2177.

[7]  Wessel N van Wieringen,et al.  Testing the prediction error difference between 2 predictors. , 2009, Biostatistics.

[8]  Bernhard Schölkopf,et al.  A Kernel Two-Sample Test , 2012, J. Mach. Learn. Res..

[9]  Hanxiang Peng,et al.  Asymptotic normality of quadratic forms with random vectors of increasing dimension , 2018, J. Multivar. Anal..

[10]  Adam A. Margolin,et al.  The Cancer Cell Line Encyclopedia enables predictive modeling of anticancer drug sensitivity , 2012, Nature.

[11]  Prahlad T. Ram,et al.  A pan-cancer proteomic perspective on The Cancer Genome Atlas , 2014, Nature Communications.

[12]  N. Verzelen,et al.  A global homogeneity test for high-dimensional linear regression , 2013, 1308.3568.

[13]  Jianqing Fan,et al.  Nonconcave penalized likelihood with a diverging number of parameters , 2004, math/0406466.

[14]  Song-xi Chen,et al.  A two-sample test for high-dimensional data with applications to gene-set testing , 2010, 1002.4547.

[15]  Trevor Hastie,et al.  Regularization Paths for Generalized Linear Models via Coordinate Descent. , 2010, Journal of statistical software.

[16]  Jun Yu Li,et al.  Two Sample Tests for High Dimensional Covariance Matrices , 2012, 1206.0917.

[17]  Z. Bai,et al.  EFFECT OF HIGH DIMENSION: BY AN EXAMPLE OF A TWO SAMPLE PROBLEM , 1999 .

[18]  R. Davies The distribution of a linear combination of 2 random variables , 1980 .

[19]  P. Rosenbaum An exact distribution‐free test comparing two multivariate distributions based on adjacency , 2005 .

[20]  R. Tibshirani,et al.  Sparse inverse covariance estimation with the graphical lasso. , 2008, Biostatistics.

[21]  L. Wasserman,et al.  HIGH DIMENSIONAL VARIABLE SELECTION. , 2007, Annals of statistics.

[22]  Martin J. Wainwright,et al.  A More Powerful Two-Sample Test in High Dimensions using Random Projection , 2011, NIPS.

[23]  Peter Buhlmann Statistical significance in high-dimensional linear models , 2012, 1202.1377.

[24]  Cun-Hui Zhang,et al.  Confidence intervals for low dimensional parameters in high dimensional linear models , 2011, 1110.2563.

[25]  Måns Thulin,et al.  A high-dimensional two-sample test for the mean using random subspaces , 2013, Comput. Stat. Data Anal..

[26]  P. Bühlmann Statistical significance in high-dimensional linear models , 2013 .

[27]  T. Cai,et al.  Two-Sample Covariance Matrix Testing and Support Recovery in High-Dimensional and Sparse Settings , 2013 .

[28]  S. Portnoy Asymptotic Behavior of Likelihood Methods for Exponential Families when the Number of Parameters Tends to Infinity , 1988 .

[29]  Q. Vuong Likelihood Ratio Tests for Model Selection and Non-Nested Hypotheses , 1989 .