Statistical properties of sketching algorithms

Sketching is a probabilistic data compression technique that has been largely developed in the computer science community. Numerical operations on big datasets can be intolerably slow; sketching algorithms address this issue by generating a smaller surrogate dataset. Typically, inference proceeds on the compressed dataset. Sketching algorithms generally use random projections to compress the original dataset and this stochastic generation process makes them amenable to statistical analysis. We argue that the sketched data can be modelled as a random sample, thus placing this family of data compression methods firmly within an inferential framework. In particular, we focus on the Gaussian, Hadamard and Clarkson-Woodruff sketches, and their use in single pass sketching algorithms for linear regression with huge $n$. We explore the statistical properties of sketched regression algorithms and derive new distributional results for a large class of sketched estimators. A key result is a conditional central limit theorem for data oblivious sketches. An important finding is that the best choice of sketching algorithm in terms of mean square error is related to the signal to noise ratio in the source dataset. Finally, we demonstrate the theory and the limits of its applicability on two real datasets.

[1]  H. White Asymptotic theory for econometricians , 1985 .

[2]  David P. Woodruff,et al.  Low rank approximation and regression in input sparsity time , 2013, STOC '13.

[3]  Jeff M. Phillips,et al.  Coresets and Sketches , 2016, ArXiv.

[4]  D. Sengupta Linear models , 2003 .

[5]  Ping Ma,et al.  A statistical perspective on algorithmic leveraging , 2013, J. Mach. Learn. Res..

[6]  Marek Petrik,et al.  Robust Partially-Compressed Least-Squares , 2015, AAAI.

[7]  Peter Richtárik,et al.  Randomized Iterative Methods for Linear Systems , 2015, SIAM J. Matrix Anal. Appl..

[8]  Rajen Dinesh Shah,et al.  Min-wise hashing for large-scale regression and classication with sparse data , 2013, 1308.1269.

[9]  Edgar Dobriban,et al.  Asymptotics for Sketching in Least Squares Regression , 2018, NeurIPS.

[10]  S. R. Searle Linear Models: Searle/Linear , 1997 .

[11]  P. Donnelly,et al.  The UK Biobank resource with deep phenotyping and genomic data , 2018, Nature.

[12]  Michel Loève,et al.  Probability Theory I , 1977 .

[13]  Shusen Wang,et al.  Error Estimation for Randomized Least-Squares Algorithms via the Bootstrap , 2018, ICML.

[14]  David P. Woodruff Sketching as a Tool for Numerical Linear Algebra , 2014, Found. Trends Theor. Comput. Sci..

[15]  Jean-Paul Chilès,et al.  Wiley Series in Probability and Statistics , 2012 .

[16]  Janson Svante,et al.  Some pairwise independent sequences for which the central limit theorem fails , 1988 .

[17]  P. Donnelly,et al.  A Flexible and Accurate Genotype Imputation Method for the Next Generation of Genome-Wide Association Studies , 2009, PLoS genetics.

[18]  Morris L. Eaton Chapter 8: The Wishart Distribution , 2007 .

[19]  D. Dunson,et al.  Efficient Gaussian process regression for large datasets. , 2011, Biometrika.

[20]  Rémi Bardenet,et al.  A note on replacing uniform subsampling by random projections in MCMC for linear regression of tall datasets , 2015 .

[21]  Petros Drineas,et al.  Structural Properties Underlying High-Quality Randomized Numerical Linear Algebra Algorithms , 2016, Handbook of Big Data.

[22]  Tamás Sarlós,et al.  Improved Approximation Algorithms for Large Matrices via Random Projections , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[23]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[24]  Christian Sohler,et al.  Random projections for Bayesian regression , 2015, Statistics and Computing.

[25]  Michael W. Mahoney,et al.  A Statistical Perspective on Randomized Sketching for Ordinary Least-Squares , 2014, J. Mach. Learn. Res..

[26]  Bernard Chazelle,et al.  The Fast Johnson--Lindenstrauss Transform and Approximate Nearest Neighbors , 2009, SIAM J. Comput..

[27]  Xiaoxiao Sun,et al.  Leveraging for big data regression , 2015 .

[28]  Rong Jin,et al.  Fast Sparse Least-Squares Regression with Non-Asymptotic Guarantees , 2015, ArXiv.

[29]  Kenneth Ward Church,et al.  Very sparse random projections , 2006, KDD '06.

[30]  G. Shorack Probability for Statisticians , 2000 .

[31]  R. Samworth,et al.  Random‐projection ensemble classification , 2015, 1504.04595.

[32]  P. Billingsley,et al.  Convergence of Probability Measures , 1970, The Mathematical Gazette.

[33]  Dominik Szynal,et al.  On the central limit theorem for negatively correlated random variables with negatively correlated squares , 2000 .

[34]  W. Greene,et al.  计量经济分析 = Econometric analysis , 2009 .

[35]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[36]  Graham Cormode,et al.  Sketch Techniques for Approximate Query Processing , 2010 .

[37]  David B. Dunson,et al.  Bayesian Compressed Regression , 2013, ArXiv.

[38]  M. L. Eaton Multivariate statistics : a vector space approach , 1985 .

[39]  Edgar Dobriban,et al.  A New Theory for Sketching in Linear Regression , 2018, ArXiv.

[40]  Ilse C. F. Ipsen,et al.  Randomized Least Squares Regression: Combining Model- and Algorithm-Induced Uncertainties , 2018, ArXiv.

[41]  M. H. Hansen,et al.  On the Theory of Sampling from Finite Populations , 1943 .

[42]  Michael W. Mahoney,et al.  Sub-Sampled Newton Methods I: Globally Convergent Algorithms , 2016, ArXiv.

[43]  Eric R. Ziegel,et al.  Multivariate Statistical Modelling Based on Generalized Linear Models , 2002, Technometrics.

[44]  Ilse C. F. Ipsen,et al.  A Projector-Based Approach to Quantifying Total and Excess Uncertainties for Sketched Linear Regression , 2018 .

[45]  Gérard Letac,et al.  All Invariant Moments of the Wishart Distribution , 2004 .

[46]  Martin J. Wainwright,et al.  Iterative Hessian Sketch: Fast and Accurate Solution Approximation for Constrained Least-Squares , 2014, J. Mach. Learn. Res..

[47]  Nicolai Meinshausen,et al.  Random Projections for Large-Scale Regression , 2017, 1701.05325.

[48]  David B. Dunson,et al.  Bayesian Data Analysis , 2010 .

[49]  P. Billingsley,et al.  Probability and Measure , 1980 .

[50]  Trevor J. Sweeting,et al.  On conditional weak convergence , 1989 .

[51]  S. Muthukrishnan,et al.  Sampling algorithms for l2 regression and applications , 2006, SODA '06.

[52]  H. Wickham Data about flights departing NYC in 2013 , 2014 .

[53]  Dean P. Foster,et al.  New Subsampling Algorithms for Fast Least Squares Regression , 2013, NIPS.

[54]  David P. Woodruff,et al.  Subspace Embeddings for the Polynomial Kernel , 2014, NIPS.

[55]  William J. Astle,et al.  Allelic Landscape of Human Blood Cell Trait Variation and Links , 2016 .

[56]  D. Freedman,et al.  Asymptotics of Graphical Projection Pursuit , 1984 .

[57]  Michael W. Mahoney,et al.  Implementing Randomized Matrix Algorithms in Parallel and Distributed Environments , 2015, Proceedings of the IEEE.

[58]  Michael W. Mahoney,et al.  Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression , 2012, STOC '13.

[59]  A. V. D. Vaart Asymptotic Statistics: Delta Method , 1998 .