Nonparametric Heterogeneity Testing For Massive Data

A massive dataset often consists of a growing number of (potentially) heterogeneous sub-populations. This paper is concerned about testing various forms of heterogeneity arising from massive data. In a general nonparametric framework, a set of testing procedures are designed to accommodate a growing number of sub-populations, denoted as $s$, with computational feasibility. In theory, their null limit distributions are derived as being nearly Chi-square with diverging degrees of freedom as long as $s$ does not grow too fast. Interestingly, we find that a lower bound on $s$ needs to be set for obtaining a sufficiently powerful testing result, so-called "blessing of aggregation." As a by-produc, a type of homogeneity testing is also proposed with a test statistic being aggregated over all sub-populations. Numerical results are presented to support our theory.

[1]  Han Liu,et al.  A PARTIALLY LINEAR FRAMEWORK FOR MASSIVE HETEROGENEOUS DATA. , 2014, Annals of statistics.

[2]  Song-xi Chen,et al.  Anova For Longitudinal Data With Missing Values , 2010, 1211.2979.

[3]  Christopher K. I. Williams,et al.  Understanding Gaussian Process Regression Using the Equivalent Kernel , 2004, Deterministic and Statistical Methods in Machine Learning.

[4]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2004 .

[5]  Runze Li,et al.  MULTIVARIATE VARYING COEFFICIENT MODEL FOR FUNCTIONAL RESPONSES. , 2012, Annals of statistics.

[6]  John Shawe-Taylor,et al.  Covering numbers for support vector machines , 1999, COLT '99.

[7]  D. Luenberger Optimization by Vector Space Methods , 1968 .

[8]  Jon A. Wellner,et al.  Ratio Limit Theorems for Empirical Processes , 2003 .

[9]  I. Krasikov New bounds on the Hermite polynomials , 2004, math/0401310.

[10]  M. Birman,et al.  PIECEWISE-POLYNOMIAL APPROXIMATIONS OF FUNCTIONS OF THE CLASSES $ W_{p}^{\alpha}$ , 1967 .

[11]  Hyunjoong Kim,et al.  Functional Analysis I , 2017 .

[12]  Guang Cheng,et al.  Joint asymptotics for semi-nonparametric regression models with partially linear structure , 2013, 1311.2628.

[13]  Minge Xie,et al.  A Split-and-Conquer Approach for Analysis of Extraordinarily Large Data , 2014 .

[14]  Guang Cheng,et al.  Local and global asymptotic inference in smoothing spline models , 2012, 1212.6788.

[15]  Yuan Yao,et al.  Mercer's Theorem, Feature Maps, and Smoothing , 2006, COLT.

[16]  M. Kosorok Introduction to Empirical Processes and Semiparametric Inference , 2008 .

[17]  Xiangyu Wang,et al.  Parallelizing MCMC with Random Partition Trees , 2015, NIPS.

[18]  V. Koltchinskii,et al.  Concentration inequalities and asymptotic results for ratio type empirical processes , 2006, math/0606788.

[19]  Bernhard Schölkopf,et al.  Generalization Performance of Regularization Networks and Support Vector Machines via Entropy Numbers of Compact Operators , 1998 .

[20]  Bernhard Schölkopf,et al.  A Generalized Representer Theorem , 2001, COLT/EuroCOLT.

[21]  Runze Li,et al.  Statistical inference in massive data sets , 2012 .

[22]  Peter F. de Jong,et al.  A central limit theorem for generalized quadratic forms , 1987 .

[23]  Winfried Stute,et al.  Nonparametric comparison of regression functions , 2010, J. Multivar. Anal..

[24]  Chong Gu Smoothing Spline Anova Models , 2002 .

[25]  Wensheng Guo Inference in smoothing spline analysis of variance , 2002 .

[26]  William M. Shyu,et al.  Local Regression Models , 2017 .

[27]  I. Pinelis OPTIMUM BOUNDS FOR THE DISTRIBUTIONS OF MARTINGALES IN BANACH SPACES , 1994, 1208.2200.

[28]  Wenceslao González-Manteiga,et al.  Testing for the equality of k regression curves , 2007 .

[29]  Christopher K. I. Williams,et al.  Gaussian regression and optimal finite dimensional linear models , 1997 .

[30]  Holger Dette,et al.  Nonparametric comparison of several regression functions: exact and asymptotic theory , 1998 .

[31]  Martin J. Wainwright,et al.  Divide and Conquer Kernel Ridge Regression , 2013, COLT.

[32]  R. Tibshirani,et al.  Varying‐Coefficient Models , 1993 .

[33]  Jianqing Fan,et al.  Distributed Estimation and Inference with Statistical Guarantees , 2015, 1509.05457.

[34]  John D. Lafferty,et al.  Diffusion Kernels on Statistical Manifolds , 2005, J. Mach. Learn. Res..

[35]  Qiang Liu,et al.  Communication-efficient sparse regression: a one-shot approach , 2015, ArXiv.

[36]  Purnamrita Sarkar,et al.  A scalable bootstrap for massive data , 2011, 1112.5016.

[37]  B. Carl,et al.  Entropy, Compactness and the Approximation of Operators , 1990 .

[38]  Hans Triebel,et al.  Inequalities between eigenvalues, entropy numbers, and related quantities of compact operators in Banach spaces , 1980 .