Scalable subsampling: computation, aggregation and inference

Subsampling is a general statistical method developed in the 1990s aimed at estimating the sampling distribution of a statistic θ̂n in order to conduct nonparametric inference such as the construction of confidence intervals and hypothesis tests. Subsampling has seen a resurgence in the Big Data era where the standard, full-resample size bootstrap can be infeasible to compute. Nevertheless, even choosing a single random subsample of size b can be computationally challenging with both b and the sample size n being very large. In the paper at hand, we show how a set of appropriately chosen, non-random subsamples can be used to conduct effective—and computationally feasible—distribution estimation via subsampling. Further, we show how the same set of subsamples can be used to yield a procedure for subsampling aggregation—also known as subagging—that is scalable with big data. Interestingly, the scalable subagging estimator can be tuned to have the same (or better) rate of convergence as compared to θ̂n. The paper is concluded by showing how to conduct inference, e.g., confidence intervals, based on the scalable subagging estimator instead of the original θ̂n.

[1]  Leo Breiman,et al.  Bagging Predictors , 1996, Machine Learning.

[2]  D. Politis,et al.  Time Series: A First Course with Bootstrap Starter , 2019 .

[3]  F. Götze,et al.  RESAMPLING FEWER THAN n OBSERVATIONS: GAINS, LOSSES, AND REMEDIES FOR LOSSES , 2012 .

[4]  N. Lin,et al.  Fast surrogates of U-statistics , 2010, Comput. Stat. Data Anal..

[5]  Srijan Sengupta,et al.  A Subsampled Double Bootstrap for Massive Data , 2015, 1508.01126.

[6]  J. Swanepoel A note on proving that the (modified) bootstrap works , 1986 .

[7]  Martin J. Wainwright,et al.  Communication-efficient algorithms for statistical optimization , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[8]  F. Götze,et al.  Adaptive choice of bootstrap sample sizes , 2001 .

[9]  Stéphan Clémençon,et al.  Empirical Processes in Survey Sampling with (Conditional) Poisson Designs , 2017 .

[10]  Joseph P. Romano,et al.  Large Sample Confidence Regions Based on Subsamples under Minimal Assumptions , 1994 .

[11]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[12]  R. Tibshirani Regression Shrinkage and Selection via the Lasso , 1996 .

[13]  Minge Xie,et al.  A Split-and-Conquer Approach for Analysis of Extraordinarily Large Data , 2014 .

[14]  Purnamrita Sarkar,et al.  A scalable bootstrap for massive data , 2011, 1112.5016.

[15]  Jelena Bradic Randomized maximum-contrast selection: subagging for large-scale regression , 2013, 1306.3494.

[16]  D. Freedman,et al.  Some Asymptotic Theory for the Bootstrap , 1981 .

[17]  Liuhua Peng,et al.  Distributed statistical inference for massive data , 2018, The Annals of Statistics.

[18]  Murray Rosenblatt,et al.  Stochastic Curve Estimation , 1991 .

[19]  HaiYing Wang,et al.  A Review on Optimal Subsampling Methods for Massive Datasets , 2021, Journal of Data Science.

[20]  J. Bretagnolle,et al.  Lois limites du Bootstrap de certaines fonctionnelles , 1983 .

[21]  P. Bickel,et al.  ON THE CHOICE OF m IN THE m OUT OF n BOOTSTRAP AND CONFIDENCE BOUNDS FOR EXTREMA , 2008 .

[22]  P. Bühlmann,et al.  Analyzing Bagging , 2001 .