Efficient subsampling for exponential family models

We propose a novel two-stage subsampling algorithm based on optimal design principles. In the first stage, we use a density-based clustering algorithm to identify an approximating design space for the predictors from an initial subsample. Next, we determine an optimal approximate design on this design space. Finally, we use matrix distances such as the Procrustes, Frobenius, and square-root distance to define the remaining subsample, such that its points are"closest"to the support points of the optimal design. Our approach reflects the specific nature of the information matrix as a weighted sum of non-negative definite Fisher information matrices evaluated at the design points and applies to a large class of regression models including models where the Fisher information is of rank larger than $1$.

[1]  Min Ren,et al.  Subdata selection based on orthogonal array for big data , 2021, Communications in Statistics - Theory and Methods.

[2]  Weng Kee Wong,et al.  Orthogonal subsampling for big data linear regression , 2021, The Annals of Applied Statistics.

[3]  Laura Deldossi,et al.  Optimal design subsampling from Big Datasets , 2021, Journal of Quality Technology.

[4]  Fei Wang,et al.  Optimal subsampling for large-scale quantile regression , 2021, J. Complex..

[5]  Haiying Wang,et al.  Information-based optimal subdata selection for big data logistic regression , 2020, Journal of Statistical Planning and Inference.

[6]  Mingyao Ai,et al.  Optimal Distributed Subsampling for Maximum Quasi-Likelihood Estimators With Massive Data , 2020, Journal of the American Statistical Association.

[7]  Yanyuan Ma,et al.  Optimal subsampling for quantile regression in big data , 2020, Biometrika.

[8]  HaiYing Wang,et al.  More Efficient Estimation for Logistic Regression with Optimal Subsamples , 2018, J. Mach. Learn. Res..

[9]  Min Yang,et al.  Information-Based Optimal Subdata Selection for Big Data Linear Regression , 2017, Journal of the American Statistical Association.

[10]  Rong Zhu,et al.  Optimal Subsampling for Large Sample Logistic Regression , 2017, Journal of the American Statistical Association.

[11]  Piercesare Secchi,et al.  Distances and inference for covariance operators , 2014 .

[12]  Guillaume Sagnol,et al.  Computing exact D-optimal designs by mixed integer second-order cone programming , 2013, 1307.4953.

[13]  Ping Ma,et al.  A statistical perspective on algorithmic leveraging , 2013, J. Mach. Learn. Res..

[14]  I. Dryden,et al.  Non-Euclidean statistics for covariance matrices, with applications to diffusion tensor imaging , 2009, 0910.1656.

[15]  T. Holland-Letz,et al.  A geometric characterization of c-optimal designs for heteroscedastic regression , 2009, 0911.3801.

[16]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[17]  S. Silvey,et al.  Optimal Design: An Introduction to the Theory for Parameter Estimation , 1980 .

[18]  J. Kiefer General Equivalence Theory for Optimum Designs (Approximate Theory) , 1974 .

[19]  HaiYing Wang,et al.  A Review on Optimal Subsampling Methods for Massive Datasets , 2021, Journal of Data Science.

[20]  Anthony C. Atkinson,et al.  Optimum Experimental Designs, with SAS , 2007 .

[21]  C. Martínez Partial Quicksort , 2003 .

[22]  A. Nemirovski,et al.  Lectures on modern convex optimization - analysis, algorithms, and engineering applications , 2001, MPS-SIAM series on optimization.

[23]  Anthony C. Atkinson,et al.  D-Optimum Designs for Heteroscedastic Linear Models , 1995 .

[24]  Michael Jackson,et al.  Optimal Design of Experiments , 1994 .

[25]  E. L. Lehmann,et al.  Theory of point estimation , 1950 .