New Subsampling Algorithms for Fast Least Squares Regression

We address the problem of fast estimation of ordinary least squares (OLS) from large amounts of data (n ≫ p). We propose three methods which solve the big data problem by subsampling the covariance matrix using either a single or two stage estimation. All three run in the order of size of input i.e. O(np) and our best method, Uluru, gives an error bound of O(√p/n) which is independent of the amount of subsampling as long as it is above a threshold. We provide theoretical bounds for our algorithms in the fixed design (with Randomized Hadamard preconditioning) as well as sub-Gaussian random design setting. We also compare the performance of our methods on synthetic and real-world datasets and show that if observations are i.i.d., sub-Gaussian then one can directly subsample without the expensive Randomized Hadamard preconditioning without loss of accuracy.

[1]  Bernard Chazelle,et al.  Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform , 2006, STOC '06.

[2]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[3]  Dean P. Foster,et al.  Two Step CCA: A new spectral method for estimating vector models of words , 2012, ICML 2012.

[4]  V. Rokhlin,et al.  A fast randomized algorithm for overdetermined linear least-squares regression , 2008, Proceedings of the National Academy of Sciences.

[5]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[6]  Dean P. Foster,et al.  Multi-View Learning of Word Embeddings via CCA , 2011, NIPS.

[7]  S. Muthukrishnan,et al.  Faster least squares approximation , 2007, Numerische Mathematik.

[8]  R. Vershynin How Close is the Sample Covariance Matrix to the Actual Covariance Matrix? , 2010, 1004.3484.

[9]  Christos Boutsidis,et al.  Improved Matrix Algorithms via the Subsampled Randomized Hadamard Transform , 2012, SIAM J. Matrix Anal. Appl..

[10]  Shai Shalev-Shwartz,et al.  Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[11]  Roman Vershynin,et al.  Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[12]  Mark Tygert,et al.  A fast algorithm for computing minimal-norm solutions to underdetermined systems of linear equations , 2009, ArXiv.

[13]  Sivan Toledo,et al.  Blendenpik: Supercharging LAPACK's Least-Squares Solver , 2010, SIAM J. Sci. Comput..