Communication-optimal Distributed Principal Component Analysis in the Column-partition Model

We study the Principal Component Analysis (PCA) problem in the distributed and streaming models of computation. Given a matrix $A \in R^{m \times n},$ a rank parameter $k < rank(A)$, and an accuracy parameter $0 < \epsilon < 1$, we want to output an $m \times k$ orthonormal matrix $U$ for which $$ || A - U U^T ||_F^2 \le \left(1 + \epsilon \right) \cdot || A - A_k||_F^2, $$ where $A_k \in R^{m \times n}$ is the best rank-$k$ approximation to $A$. This paper provides improved algorithms for distributed PCA and streaming PCA.

[1]  Gene H. Golub,et al.  Matrix computations , 1983 .

[2]  Charles R. Johnson,et al.  Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[3]  Michael D. Vose,et al.  A Linear Algorithm For Generating Random Numbers With a Given Distribution , 1991, IEEE Trans. Software Eng..

[4]  R Jessup,et al.  A Parallel Algorithm for Computing the Singular Value Decomposition of a Matrix:A Revision of Argonne National Laboratory Tech. Report ANL/MCS-TM-102 ; CU-CS-623-92 , 1994 .

[5]  Stanley C. Eisenstat,et al.  A Divide-and-Conquer Algorithm for the Bidiagonal SVD , 1995, SIAM J. Matrix Anal. Appl..

[6]  Ming Gu,et al.  Efficient Algorithms for Computing a Strong Rank-Revealing QR Factorization , 1996, SIAM J. Sci. Comput..

[7]  Jack J. Dongarra,et al.  A Parallel Divide and Conquer Algorithm for the Symmetric Eigenvalue Problem on Distributed Memory Architectures , 1999, SIAM J. Sci. Comput..

[8]  N. Samatova,et al.  Principal Component Analysis for Dimension Reduction in Massive Distributed Data Sets ∗ , 2002 .

[9]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[10]  Franklin T. Luk,et al.  Principal Component Analysis for Distributed Data Sets with Updating , 2005, APPT.

[11]  Santosh S. Vempala,et al.  Matrix approximation and projective clustering via volume sampling , 2006, SODA '06.

[12]  Tamás Sarlós,et al.  Improved Approximation Algorithms for Large Matrices via Random Projections , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[13]  Santosh S. Vempala,et al.  Adaptive Sampling and Fast Low-Rank Matrix Approximation , 2006, APPROX-RANDOM.

[14]  Terence Tao,et al.  The condition number of a randomly perturbed matrix , 2007, STOC '07.

[15]  Anatoli Torokhti,et al.  Generalized Rank-Constrained Matrix Approximations , 2007, SIAM J. Matrix Anal. Appl..

[16]  Sylvain Raybaud,et al.  Distributed Principal Component Analysis for Wireless Sensor Networks , 2008, Sensors.

[17]  Nir Ailon,et al.  Fast Dimension Reduction Using Rademacher Series on Dual BCH Codes , 2008, SODA '08.

[18]  Gianluca Bontempi,et al.  Distributed Principal Component Analysis for Wireless Sensor , 2008 .

[19]  David Kempe,et al.  A decentralized algorithm for spectral analysis , 2008, J. Comput. Syst. Sci..

[20]  David P. Woodruff,et al.  Numerical linear algebra in the streaming model , 2009, STOC '09.

[21]  Nikhil Srivastava,et al.  Twice-ramanujan sparsifiers , 2008, STOC '09.

[22]  Sergio Valcarcel Macua,et al.  Consensus-based distributed principal component analysis in wireless sensor networks , 2010, 2010 IEEE 11th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC).

[23]  A. Rantzer,et al.  On a generalized matrix approximation problem in the spectral norm , 2012 .

[24]  David P. Woodruff,et al.  Low rank approximation and regression in input sparsity time , 2012, STOC '13.

[25]  James Demmel,et al.  Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures , 2013, 2013 IEEE International Conference on Big Data.

[26]  Huy L. Nguyen,et al.  OSNAP: Faster Numerical Linear Algebra Algorithms via Sparser Subspace Embeddings , 2012, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[27]  Santosh S. Vempala,et al.  Nimble Algorithms for Cloud Computing , 2013, ArXiv.

[28]  Christos Boutsidis,et al.  Improved Matrix Algorithms via the Subsampled Randomized Hadamard Transform , 2012, SIAM J. Matrix Anal. Appl..

[29]  Edo Liberty,et al.  Simple and deterministic matrix sketching , 2012, KDD.

[30]  Michael W. Mahoney,et al.  Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression , 2012, STOC '13.

[31]  Mohamed S. Kamel,et al.  Distributed Column Subset Selection on MapReduce , 2013, 2013 IEEE 13th International Conference on Data Mining.

[32]  Kunal Talwar,et al.  On differentially private low rank approximation , 2013, SODA.

[33]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[34]  Robert A. van de Geijn,et al.  Elemental: A New Framework for Distributed Memory Dense Matrix Computations , 2013, TOMS.

[35]  David P. Woodruff Low Rank Approximation Lower Bounds in Row-Update Streams , 2014, NIPS.

[36]  Jeff M. Phillips,et al.  Relative Errors for Deterministic Low-Rank Matrix Approximations , 2013, SODA.

[37]  David P. Woodruff,et al.  Improved Distributed Principal Component Analysis , 2014, NIPS.

[38]  Daniel M. Kane,et al.  Sparser Johnson-Lindenstrauss Transforms , 2010, JACM.

[39]  Christos Boutsidis,et al.  Near-Optimal Column-Based Matrix Reconstruction , 2014, SIAM J. Comput..

[40]  Santosh S. Vempala,et al.  Principal Component Analysis and Higher Correlations for Distributed Data , 2013, COLT.

[41]  David P. Woodruff Sketching as a Tool for Numerical Linear Algebra , 2014, Found. Trends Theor. Comput. Sci..

[42]  Christos Boutsidis,et al.  Optimal CUR matrix decompositions , 2014, STOC.

[43]  Prateek Jain,et al.  Tighter Low-rank Approximation via Sampling the Leveraged Element , 2015, SODA.

[44]  Le Song,et al.  Distributed Kernel Principal Component Analysis , 2015, ArXiv.

[45]  Sjoerd Dirksen,et al.  Toward a unified theory of sparse dimensionality reduction in Euclidean space , 2013, STOC.