A Massive Data Framework for M-Estimators with Cubic-Rate

ABSTRACT The divide and conquer method is a common strategy for handling massive data. In this article, we study the divide and conquer method for cubic-rate estimators under the massive data framework. We develop a general theory for establishing the asymptotic distribution of the aggregated M-estimators using a weighted average with weights depending on the subgroup sample sizes. Under certain condition on the growing rate of the number of subgroups, the resulting aggregated estimators are shown to have faster convergence rate and asymptotic normal distribution, which are more tractable in both computation and inference than the original M-estimators based on pooled data. Our theory applies to a wide class of M-estimators with cube root convergence rate, including the location estimator, maximum score estimator, and value search estimator. Empirical performance via simulations and a real data application also validate our theoretical findings. Supplementary materials for this article are available online.

[1]  T. Speed,et al.  On the Application of Probability Theory to Agricultural Experiments. Essay on Principles. Section 9 , 1990 .

[2]  Second class particles and cube root asymptotics for Hammersley's process , 2006, math/0603345.

[3]  Emmanuel Rio,et al.  Local invariance principles and their application to density estimation , 1994 .

[4]  H. Chernoff Estimation of the mode , 1964 .

[5]  Cécile Durot,et al.  Sharp asymptotics for isotonic regression , 2002 .

[6]  C. Geyer On the Asymptotics of Constrained $M$-Estimation , 1994 .

[7]  Kengo Kato,et al.  Gaussian approximations and multiplier bootstrap for maxima of sums of high-dimensional random vectors , 2012, 1212.6906.

[8]  K. Athreya,et al.  Measure Theory and Probability Theory (Springer Texts in Statistics) , 2006 .

[9]  Soumendu Sundar Mukherjee,et al.  Weak convergence and empirical processes , 2019 .

[10]  On the Location of the Maximum of a Continuous Stochastic Process , 2014, Journal of Applied Probability.

[11]  Minge Xie,et al.  A Split-and-Conquer Approach for Analysis of Extraordinarily Large Data , 2014 .

[12]  Martin J. Wainwright,et al.  Divide and Conquer Kernel Ridge Regression , 2013, COLT.

[13]  M. Marcus,et al.  Markov Processes, Gaussian Processes, and Local Times: Contents , 2006 .

[14]  Shuheng Zhou Restricted Eigenvalue Conditions on Subgaussian Random Matrices , 2009, 0912.4045.

[15]  A. I. Sakhanenko Estimates in the invariance principle in terms of truncated power moments , 2006 .

[16]  P. Major,et al.  An approximation of partial sums of independent RV'-s, and the sample DF. I , 1975 .

[17]  Kengo Kato,et al.  Comparison and anti-concentration bounds for maxima of Gaussian random vectors , 2013, 1301.4807.

[18]  H. E. Daniels,et al.  The maximum of a random walk whose mean path has a maximum , 1985 .

[19]  M. Kosorok Introduction to Empirical Processes and Semiparametric Inference , 2008 .

[20]  Han Liu,et al.  A PARTIALLY LINEAR FRAMEWORK FOR MASSIVE HETEROGENEOUS DATA. , 2014, Annals of statistics.

[21]  D. Pollard,et al.  Cube Root Asymptotics , 1990 .

[22]  D. Rubin [On the Application of Probability Theory to Agricultural Experiments. Essay on Principles. Section 9.] Comment: Neyman (1923) and Causal Inference in Experiments and Observational Studies , 1990 .

[23]  P. Révész,et al.  Strong Approximations of the Quantile Process , 1978 .

[24]  Jon A. Wellner,et al.  Weak Convergence and Empirical Processes: With Applications to Statistics , 1996 .

[25]  K. Athreya,et al.  Measure Theory and Probability Theory , 2006 .

[26]  Lawrence D. Brown,et al.  Superefficiency in Nonparametric Function Estimation , 1997 .

[27]  Moulinath Banerjee,et al.  Divide and conquer in nonstandard problems and the super-efficiency phenomenon , 2016, The Annals of Statistics.

[28]  P. Groeneboom Brownian motion with a parabolic drift and airy functions , 1989 .

[29]  Kengo Kato,et al.  Gaussian approximation of suprema of empirical processes , 2012, 1212.6885.

[30]  Hendrik P. Lopuhaä,et al.  Asymptotic normality of the $L_1$ error of the Grenander estimator , 1999 .

[31]  S. Chatterjee An error bound in the Sudakov-Fernique inequality , 2005, math/0510424.

[32]  Vladimir Koltchinskii,et al.  Komlos-Major-Tusnady approximation for the general empirical process and Haar expansions of classes of functions , 1994 .

[33]  Purnamrita Sarkar,et al.  A scalable bootstrap for massive data , 2011, 1112.5016.

[34]  R. Adamczak A tail inequality for suprema of unbounded empirical processes with applications to Markov chains , 2007, 0709.3110.

[35]  Eric B. Laber,et al.  A Robust Method for Estimating Optimal Treatment Regimes , 2012, Biometrics.