Direction-Projection-Permutation for High-Dimensional Hypothesis Tests

High-dimensional low sample size (HDLSS) data are becoming increasingly common in statistical applications. When the data can be partitioned into two classes, a basic task is to construct a classifier that can assign objects to the correct class. Binary linear classifiers have been shown to be especially useful in HDLSS settings and preferable to more complicated classifiers because of their ease of interpretability. We propose a computational tool called direction-projection-permutation (DiProPerm), which rigorously assesses whether a binary linear classifier is detecting statistically significant differences between two high-dimensional distributions. The basic idea behind DiProPerm involves working directly with the one-dimensional projections of the data induced by binary linear classifier. Theoretical properties of DiProPerm are studied under the HDLSS asymptotic regime whereby dimension diverges to infinity while sample size remains fixed. We show that certain variations of DiProPerm are consistent and that consistency is a nontrivial property of tests in the HDLSS asymptotic regime. The practical utility of DiProPerm is demonstrated on HDLSS gene expression microarray datasets. Finally, an empirical power study is conducted comparing DiProPerm to several alternative two-sample HDLSS tests to understand the advantages and disadvantages of each method.

[1]  James Stephen Marron,et al.  Distance‐weighted discrimination , 2015 .

[2]  Martin J. Wainwright,et al.  A More Powerful Two-Sample Test in High Dimensions using Random Projection , 2011, NIPS.

[3]  P. Hall,et al.  Permutation tests for equality of distributions in high‐dimensional settings , 2002 .

[4]  P. Bickel,et al.  Sums of Functions of Nearest Neighbor Distances, Moment Bounds, Limit Theorems and a Goodness of Fit Test , 1983 .

[5]  N. Henze A MULTIVARIATE TWO-SAMPLE TEST BASED ON THE NUMBER OF NEAREST NEIGHBOR TYPE COINCIDENCES , 1988 .

[6]  Corinna Cortes,et al.  Support-Vector Networks , 1995, Machine Learning.

[7]  I. Jolliffe Principal Component Analysis , 2002 .

[8]  L. Baringhaus,et al.  On a new multivariate two-sample test , 2004 .

[9]  A. Ghosh,et al.  Exact distribution-free two-sample tests applicable to high dimensional data , 2013 .

[10]  T. Cai,et al.  Two-Sample Covariance Matrix Testing and Support Recovery in High-Dimensional and Sparse Settings , 2013 .

[11]  M. Srivastava,et al.  A test for the mean vector with fewer observations than the dimension , 2008 .

[12]  Song-xi Chen,et al.  A two-sample test for high-dimensional data with applications to gene-set testing , 2010, 1002.4547.

[13]  Anil K. Ghosh,et al.  A nonparametric two-sample test applicable to high dimensional data , 2014, J. Multivar. Anal..

[14]  J. Marron,et al.  The maximal data piling direction for discrimination , 2010 .

[15]  J. S. Marron,et al.  Geometric representation of high dimension, low sample size data , 2005 .

[16]  J. S. Marron,et al.  A FUNCTIONAL DATA ANALYSIS APPROACH FOR EVALUATING TEMPORAL PHYSIOLOGIC RESPONSES TO PARTICULATE MATTER , 2007 .

[17]  J. Marron,et al.  PCA CONSISTENCY IN HIGH DIMENSION, LOW SAMPLE SIZE CONTEXT , 2009, 0911.3827.

[18]  Z. Bai,et al.  EFFECT OF HIGH DIMENSION: BY AN EXAMPLE OF A TWO SAMPLE PROBLEM , 1999 .

[19]  J. S. Marron,et al.  Comparison of binary discrimination methods for high dimension low sample size data , 2013, J. Multivar. Anal..

[20]  M. Schilling Multivariate Two-Sample Tests Based on Nearest Neighbors , 1986 .

[21]  J. Friedman,et al.  Multivariate generalizations of the Wald--Wolfowitz and Smirnov two-sample tests , 1979 .

[22]  C. Perou,et al.  Race, breast cancer subtypes, and survival in the Carolina Breast Cancer Study. , 2006, JAMA.

[23]  Joseph P. Romano,et al.  EXACT AND ASYMPTOTICALLY ROBUST PERMUTATION TESTS , 2013, 1304.5939.

[24]  A. Janssen,et al.  Studentized permutation tests for non-i.i.d. hypotheses and the generalized Behrens-Fisher problem , 1997 .

[25]  A. Nobel,et al.  Statistical Significance of Clustering for High-Dimension, Low–Sample Size Data , 2008 .

[26]  Yifan Huang,et al.  To permute or not to permute , 2006, Bioinform..

[27]  M. D. Ernst Permutation Methods: A Basis for Exact Inference , 2004 .

[28]  Weidong Liu,et al.  Two‐sample test of high dimensional means under dependence , 2014 .

[29]  Welch Bl THE GENERALIZATION OF ‘STUDENT'S’ PROBLEM WHEN SEVERAL DIFFERENT POPULATION VARLANCES ARE INVOLVED , 1947 .

[30]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[31]  Maria L. Rizzo,et al.  TESTING FOR EQUAL DISTRIBUTIONS IN HIGH DIMENSION , 2004 .