Robust, Scalable, and Fast Bootstrap Method for Analyzing Large Scale Data

In this paper we address the problem of performing statistical inference for large scale data sets i.e., Big Data. The volume and dimensionality of the data may be so high that it cannot be processed or stored in a single computing node. We propose a scalable, statistically robust and computationally efficient bootstrap method, compatible with distributed processing and storage systems. Bootstrap resamples are constructed with smaller number of distinct data points on multiple disjoint subsets of data, similarly to the bag of little bootstrap method (BLB) [A. Kleiner, A. Talwalkar, P. Sarkar, and M. I. Jordan, “A scalable bootstrap for massive data,” J. Roy. Statist. Soc.: Ser. B (Statist. Methodol.), vol. 76, no. 4, pp. 795-816, 2014]. The disjoint subsets are significantly smaller than the original full data set and they may be processed in different storage and computing units in parallel. Then significant savings in computation is achieved by avoiding the recomputation of the estimator for each bootstrap sample. Instead, a computationally efficient fixed-point estimation equation is analytically solved via a smart approximation following the Fast and Robust Bootstrap method (FRB) [M. Salibián-Barrera, S. Van Aelst, and G. Willems, “Fast and robust bootstrap,” Statist. Methods Appl., vol. 17, no. 1, pp. 41-71, 2008]. Our proposed bootstrap method facilitates the use of highly robust statistical methods in analyzing large scale data sets. The favorable statistical properties of the method are established analytically. Numerical examples demonstrate scalability, low complexity and robust statistical performance of the method in analyzing large data sets.

[1]  Peter J. Rousseeuw,et al.  ROBUST REGRESSION BY MEANS OF S-ESTIMATORS , 1984 .

[2]  Kesar Singh,et al.  Breakdown theory for bootstrap quantiles , 1998 .

[3]  Peter J. Rousseeuw,et al.  Robust regression and outlier detection , 1987 .

[4]  David Hinkley,et al.  Bootstrap Methods: Another Look at the Jackknife , 2008 .

[5]  A. V. D. Vaart,et al.  Asymptotic Statistics: Frontmatter , 1998 .

[6]  Visa Koivunen,et al.  Fast and robust bootstrap method for testing hypotheses in the ICA model , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Anthony C. Davison,et al.  Bootstrap Methods and Their Application , 1998 .

[8]  M. Kenward,et al.  An Introduction to the Bootstrap , 2007 .

[9]  V. Yohai HIGH BREAKDOWN-POINT AND HIGH EFFICIENCY ROBUST ESTIMATES FOR REGRESSION , 1987 .

[10]  Visa Koivunen,et al.  Fast and robust bootstrap in analysing large multivariate datasets , 2014, 2014 48th Asilomar Conference on Signals, Systems and Computers.

[11]  Aapo Hyvärinen,et al.  Fast and robust fixed-point algorithms for independent component analysis , 1999, IEEE Trans. Neural Networks.

[12]  S. T. Buckland,et al.  An Introduction to the Bootstrap. , 1994 .

[13]  M. Salibian-Barrera Contributions to the theory of robust inference , 2000 .

[14]  Purnamrita Sarkar,et al.  A scalable bootstrap for massive data , 2011, 1112.5016.

[15]  F. Götze,et al.  RESAMPLING FEWER THAN n OBSERVATIONS: GAINS, LOSSES, AND REMEDIES FOR LOSSES , 2012 .

[16]  Boualem Boashash,et al.  The bootstrap and its application in signal processing , 1998, IEEE Signal Process. Mag..

[17]  Abdelhak M. Zoubir,et al.  Bootstrap techniques for signal processing , 2004 .

[18]  Stefan Van Aelst,et al.  Fast and robust bootstrap , 2008, Stat. Methods Appl..

[19]  R. Zamar,et al.  Bootstrapping robust estimates of regression , 2002 .

[20]  V. Yohai,et al.  A Fast Algorithm for S-Regression Estimates , 2006 .

[21]  Matias Salibian-Barrera,et al.  Estimating the p-values of robust tests for the linear model , 2005 .

[22]  Christophe Croux,et al.  Robust standard errors for robust estimators , 2003 .

[23]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[24]  S. Van Aelst,et al.  Principal Components Analysis Based on Multivariate MM Estimators With Fast and Robust Bootstrap , 2006 .