Lazy Stochastic Principal Component Analysis

Stochastic principal component analysis (SPCA) has become a popular dimensionality reduction strategy for large, high-dimensional datasets. We derive a simplified algorithm, called Lazy SPCA, which has reduced computational complexity and is better suited for large-scale distributed computation. We prove that SPCA and Lazy SPCA find the same approximations to the principal subspace, and that the pairwise distances between samples in the lower-dimensional space is invariant to whether SPCA is executed lazily or not. Empirical studies find downstream predictive performance to be identical for both methods, and superior to random projections, across a range of predictive models (linear regression, logistic lasso, and random forests). In our largest experiment with 4.6 million samples, Lazy SPCA reduced 43.7 hours of computation to 9.9 hours. Overall, Lazy SPCA relies exclusively on matrix multiplications, besides an operation on a small square matrix whose size depends only on the target dimensionality.

[1]  Bernard Chazelle,et al.  The Fast Johnson--Lindenstrauss Transform and Approximate Nearest Neighbors , 2009, SIAM J. Comput..

[2]  N. Kishore Kumar,et al.  Literature survey on low rank approximation of matrices , 2016, ArXiv.

[3]  Zhang Di,et al.  Projecting "Better Than Randomly": How to Reduce the Dimensionality of Very Large Datasets in a Way That Outperforms Random Projections , 2016 .

[4]  Kenneth Ward Church,et al.  Very sparse random projections , 2006, KDD '06.

[5]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[6]  Kilian Q. Weinberger,et al.  Feature hashing for large scale multitask learning , 2009, ICML '09.

[7]  Jack W. Stokes,et al.  Large-scale malware classification using random projections and neural networks , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[9]  Xuan Zhao,et al.  “Influence sketching”: Finding influential samples in large-scale regressions , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[10]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[11]  Nathan Halko,et al.  An Algorithm for the Principal Component Analysis of Large Data Sets , 2010, SIAM J. Sci. Comput..

[12]  Jorge Cadima,et al.  Principal component analysis: a review and recent developments , 2016, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.