论文信息 - New Subsampling Algorithms for Fast Least Squares Regression

New Subsampling Algorithms for Fast Least Squares Regression

We address the problem of fast estimation of ordinary least squares (OLS) from large amounts of data (n ≫ p). We propose three methods which solve the big data problem by subsampling the covariance matrix using either a single or two stage estimation. All three run in the order of size of input i.e. O(np) and our best method, Uluru, gives an error bound of O(√p/n) which is independent of the amount of subsampling as long as it is above a threshold. We provide theoretical bounds for our algorithms in the fixed design (with Randomized Hadamard preconditioning) as well as sub-Gaussian random design setting. We also compare the performance of our methods on synthetic and real-world datasets and show that if observations are i.i.d., sub-Gaussian then one can directly subsample without the expensive Randomized Hadamard preconditioning without loss of accuracy.

[1] Bernard Chazelle,et al. Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform , 2006, STOC '06.

[2] Michael W. Mahoney. Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[3] Dean P. Foster,et al. Two Step CCA: A new spectral method for estimating vector models of words , 2012, ICML 2012.

[4] V. Rokhlin,et al. A fast randomized algorithm for overdetermined linear least-squares regression , 2008, Proceedings of the National Academy of Sciences.

[5] Gene H. Golub,et al. Matrix computations (3rd ed.) , 1996 .

[6] Dean P. Foster,et al. Multi-View Learning of Word Embeddings via CCA , 2011, NIPS.

[7] S. Muthukrishnan,et al. Faster least squares approximation , 2007, Numerische Mathematik.

[8] R. Vershynin. How Close is the Sample Covariance Matrix to the Actual Covariance Matrix? , 2010, 1004.3484.

[9] Christos Boutsidis,et al. Improved Matrix Algorithms via the Subsampled Randomized Hadamard Transform , 2012, SIAM J. Matrix Anal. Appl..

[10] Shai Shalev-Shwartz,et al. Stochastic dual coordinate ascent methods for regularized loss , 2012, J. Mach. Learn. Res..

[11] Roman Vershynin,et al. Introduction to the non-asymptotic analysis of random matrices , 2010, Compressed Sensing.

[12] Mark Tygert,et al. A fast algorithm for computing minimal-norm solutions to underdetermined systems of linear equations , 2009, ArXiv.

[13] Sivan Toledo,et al. Blendenpik: Supercharging LAPACK's Least-Squares Solver , 2010, SIAM J. Sci. Comput..