A two-sample test for the equality of univariate marginal distributions for high-dimensional data

Abstract A recurring theme in modern statistics is dealing with high-dimensional data whose main feature is a large number, p , of variables but a small sample size. In this context our aim is to address the problem of testing the null hypothesis that the marginal distributions of p variables are the same for two groups. We propose a test statistic motivated by the simple idea of comparing, for each of the p variables, the empirical characteristic functions computed from the two samples. The asymptotic normality of the test statistic is derived under mixing conditions. In our asymptotic analysis the number of variables tends to infinity, while the size of individual samples remains fixed. In order to obtain a practical test several estimators of the variance are proposed, leading to three somewhat different versions of the test. An alternative global test based on the P -values derived from permutation tests is also proposed. A simulation study to investigate the finite sample properties of the proposed tests is carried out, and a practical illustration involving microarray data is provided.

[1]  Xiaochao Xia,et al.  A test for equality of two distributions via jackknife empirical likelihood and characteristic functions , 2015, Comput. Stat. Data Anal..

[2]  Pablo Martínez-Camblor,et al.  Non-parametric k-sample tests: Density functions vs distribution functions , 2009, Comput. Stat. Data Anal..

[3]  Anil K. Ghosh,et al.  A nonparametric two-sample test applicable to high dimensional data , 2014, J. Multivar. Anal..

[4]  D. Radulovic,et al.  The bootstrap of the mean for strong mixing sequences under minimal conditions , 1999 .

[5]  Jeffrey D. Hart,et al.  Testing equality of a large number of densities under mixing conditions , 2014, TEST.

[6]  Magda Peligrad,et al.  On the asymptotic normality of sequences of weak dependent random variables , 1996 .

[7]  Irene Castro-Conde,et al.  An extended sequential goodness-of-fit multiple testing method for discrete data , 2017, Statistical methods in medical research.

[8]  D. Donoho,et al.  Higher criticism for detecting sparse heterogeneous mixtures , 2004, math/0410072.

[9]  D. B. Preston Spectral Analysis and Time Series , 1983 .

[10]  John Odenckantz,et al.  Nonparametric Statistics for Stochastic Processes: Estimation and Prediction , 2000, Technometrics.

[11]  Anil K. Ghosh,et al.  On high dimensional two-sample tests based on nearest neighbors , 2015, J. Multivar. Anal..

[12]  R. C. Bradley Basic properties of strong mixing conditions. A survey and some open questions , 2005, math/0511078.

[13]  Joaquín Muñoz-García,et al.  A test for the two-sample problem based on empirical characteristic functions , 2008, Comput. Stat. Data Anal..

[14]  M. Hahn,et al.  Proceedings of the SMBE Tri-National Young Investigators' Workshop 2005. Accurate inference and estimation in population genomics. , 2006, Molecular biology and evolution.

[15]  R. Fisher,et al.  Statistical Methods for Research Workers , 1930, Nature.

[16]  E. Dougherty,et al.  Gene-expression profiles in hereditary breast cancer. , 2001, The New England journal of medicine.

[17]  Andrew Rosalsky,et al.  On convergence properties of sums of dependent random variables under second moment and covariance restrictions , 2008 .

[18]  Antonio Carvajal-Rodríguez,et al.  A new multitest correction (SGoF) that increases its statistical power when increasing the number of tests , 2009, BMC Bioinformatics.

[19]  Y. Benjamini,et al.  Controlling the false discovery rate: a practical and powerful approach to multiple testing , 1995 .

[20]  E. Suchman,et al.  The American Soldier: Adjustment During Army Life. , 1949 .

[21]  Testing for affine equivalence of elliptically symmetric distributions , 2004 .

[22]  H. White,et al.  Automatic Block-Length Selection for the Dependent Bootstrap , 2004 .

[23]  P. Hall,et al.  PROPERTIES OF HIGHER CRITICISM UNDER STRONG DEPENDENCE , 2008, 0803.2095.

[24]  Simos G. Meintanis,et al.  Tests for the multivariate k-sample problem based on the empirical characteristic function , 2008 .

[25]  E. Carlstein The Use of Subseries Values for Estimating the Variance of a General Statistic from a Stationary Sequence , 1986 .

[26]  N. Barkai,et al.  Autocorrelation analysis reveals widespread spatial biases in microarray experiments , 2007, BMC Genomics.

[27]  Z. Bai,et al.  Corrections to LRT on large-dimensional covariance matrix by RMT , 2009, 0902.0552.

[28]  Tiefeng Jiang,et al.  Likelihood ratio tests for covariance matrices of high-dimensional normal distributions , 2012 .

[29]  J. S. Marron,et al.  Direction-Projection-Permutation for High-Dimensional Hypothesis Tests , 2013, 1304.0796.

[30]  Y. Benjamini,et al.  THE CONTROL OF THE FALSE DISCOVERY RATE IN MULTIPLE TESTING UNDER DEPENDENCY , 2001 .

[31]  Joseph F. Heyse,et al.  A False Discovery Rate Procedure for Categorical Data , 2011 .

[32]  P. Doukhan Mixing: Properties and Examples , 1994 .

[33]  Anil K. Ghosh,et al.  A distribution-free two-sample run test applicable to high-dimensional data , 2014 .

[34]  Wei Wang,et al.  On testing the equality of high dimensional mean vectors with unequal covariance matrices , 2014, 1406.6569.

[35]  R. Leipus,et al.  Rescaled variance and related tests for long memory in volatility and levels , 2003 .

[36]  Song-xi Chen,et al.  A two-sample test for high-dimensional data with applications to gene-set testing , 2010, 1002.4547.