A Selective Review on Statistical Techniques for Big Data

To meet the big data challenges, many new statistical tools have been developed in recent years. In this review, we summarize some of these approaches to give an overview of the current state of the development. We will focus on the case that the number of observations is much larger than the dimension of the unknown parameters, although we will mention some investigations related to the high-dimensional data. We will discuss methods using subsamples as well as methods processing the whole data piece-by-piece.

[1]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[2]  R. Tibshirani,et al.  Least angle regression , 2004, math/0406456.

[3]  R. Tibshirani,et al.  REJOINDER TO "LEAST ANGLE REGRESSION" BY EFRON ET AL. , 2004, math/0406474.

[4]  Bernard Chazelle,et al.  Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform , 2006, STOC '06.

[5]  S. Muthukrishnan,et al.  Sampling algorithms for l2 regression and applications , 2006, SODA '06.

[6]  Nir Ailon,et al.  Fast Dimension Reduction Using Rademacher Series on Dual BCH Codes , 2008, SODA '08.

[7]  Bernard Chazelle,et al.  The Fast Johnson--Lindenstrauss Transform and Approximate Nearest Neighbors , 2009, SIAM J. Comput..

[8]  Sivan Toledo,et al.  Blendenpik: Supercharging LAPACK's Least-Squares Solver , 2010, SIAM J. Sci. Comput..

[9]  AvronHaim,et al.  Blendenpik: Supercharging LAPACK's Least-Squares Solver , 2010 .

[10]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[11]  S. Muthukrishnan,et al.  Faster least squares approximation , 2007, Numerische Mathematik.

[12]  Ruibin Xi,et al.  Aggregated estimating equation estimation , 2011 .

[13]  David P. Woodruff,et al.  Fast approximation of matrix coherence and statistical leverage , 2011, ICML.

[14]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[15]  Xi Chen,et al.  Variance Reduction for Stochastic Gradient Optimization , 2013, NIPS.

[16]  F. Liang,et al.  A Resampling-Based Stochastic Approximation Method for Analysis of Large Geostatistical Data , 2013 .

[17]  Deanna Needell,et al.  Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm , 2013, Mathematical Programming.

[18]  Han Liu,et al.  Challenges of Big Data Analysis. , 2013, National science review.

[19]  Min‐ge Xie,et al.  A split-and-conquer approach for analysis of , 2014 .

[20]  Trevor Hastie,et al.  LOCAL CASE-CONTROL SAMPLING: EFFICIENT SUBSAMPLING IN IMBALANCED DATA SETS. , 2013, Annals of statistics.

[21]  Minge Xie,et al.  A Split-and-Conquer Approach for Analysis of Extraordinarily Large Data , 2014 .

[22]  E. Airoldi,et al.  Asymptotic and finite-sample properties of estimators based on stochastic gradients , 2014 .

[23]  Martin J. Wainwright,et al.  Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates , 2013, J. Mach. Learn. Res..

[24]  Tong Zhang,et al.  Stochastic Optimization with Importance Sampling for Regularized Loss Minimization , 2014, ICML.

[25]  Ping Ma,et al.  A statistical perspective on algorithmic leveraging , 2013, J. Mach. Learn. Res..

[26]  Matthias Katzfuss,et al.  A Multi-Resolution Approximation for Massive Spatial Datasets , 2015, 1507.04789.

[27]  Jelena Kovacevic,et al.  A statistical perspective of sampling scores for linear regression , 2015, 2016 IEEE International Symposium on Information Theory (ISIT).

[28]  Jing Wu,et al.  Online Updating of Statistical Inference in the Big Data Setting , 2015, Technometrics.

[29]  Aarti Singh,et al.  On Computationally Tractable Selection of Experiments in Measurement-Constrained Regression Models , 2016, J. Mach. Learn. Res..

[30]  Guang Cheng,et al.  Computational Limits of A Distributed Algorithm for Smoothing Spline , 2015, J. Mach. Learn. Res..

[31]  Jianqing Fan,et al.  DISTRIBUTED TESTING AND ESTIMATION UNDER SPARSE HIGH DIMENSIONAL MODELS. , 2018, Annals of statistics.

[32]  Min Yang,et al.  Information-Based Optimal Subdata Selection for Big Data Linear Regression , 2017, Journal of the American Statistical Association.

[33]  Ming-Hui Chen,et al.  Online updating method with new variables for big data streams , 2018, The Canadian journal of statistics = Revue canadienne de statistique.

[34]  HaiYing Wang,et al.  Optimal subsampling for softmax regression , 2019, Statistical Papers.

[35]  Rong Zhu,et al.  Optimal Subsampling for Large Sample Logistic Regression , 2017, Journal of the American Statistical Association.

[36]  Yan Wang,et al.  A fast divide-and-conquer sparse Cox regression. , 2018, Biostatistics.

[37]  Peter X.-K. Song,et al.  Renewable estimation and incremental inference in generalized linear models with streaming data sets , 2019, Journal of the Royal Statistical Society: Series B (Statistical Methodology).

[38]  HaiYing Wang,et al.  More Efficient Estimation for Logistic Regression with Optimal Subsamples , 2018, J. Mach. Learn. Res..

[39]  Wenxuan Zhong,et al.  Online Decentralized Leverage Score Sampling for Streaming Multidimensional Time Series , 2019, AISTATS.

[40]  HaiYing Wang,et al.  Divide-and-Conquer Information-Based Optimal Subdata Selection Algorithm , 2019, Journal of Statistical Theory and Practice.

[41]  Tong Zhang,et al.  Local Uncertainty Sampling for Large-Scale Multi-Class Logistic Regression , 2016, The Annals of Statistics.

[42]  Yanyuan Ma,et al.  Optimal subsampling for quantile regression in big data , 2020, Biometrika.

[43]  Mingyao Ai,et al.  Optimal Distributed Subsampling for Maximum Quasi-Likelihood Estimators With Massive Data , 2020, Journal of the American Statistical Association.

[44]  J. Tropp,et al.  Randomized numerical linear algebra: Foundations and algorithms , 2020, Acta Numerica.

[45]  Michael W. Mahoney,et al.  Asymptotic Analysis of Sampling Estimators for Randomized Numerical Linear Algebra Algorithms , 2020, AISTATS.

[46]  Jun Yu,et al.  OPTIMAL SUBSAMPLING ALGORITHMS FOR BIG DATA REGRESSIONS , 2018, Statistica Sinica.