论文信息 - Communication-optimal Distributed Principal Component Analysis in the Column-partition Model

Communication-optimal Distributed Principal Component Analysis in the Column-partition Model

We study the Principal Component Analysis (PCA) problem in the distributed and streaming models of computation. Given a matrix $A \in R^{m \times n},$ a rank parameter $k < rank(A)$, and an accuracy parameter $0 < \epsilon < 1$, we want to output an $m \times k$ orthonormal matrix $U$ for which $$ || A - U U^T ||_F^2 \le \left(1 + \epsilon \right) \cdot || A - A_k||_F^2, $$ where $A_k \in R^{m \times n}$ is the best rank-$k$ approximation to $A$. This paper provides improved algorithms for distributed PCA and streaming PCA.

Christos Boutsidis | David P. Woodruff | Christos Boutsidis

[1] Gene H. Golub,et al. Matrix computations , 1983 .

[2] Charles R. Johnson,et al. Matrix analysis , 1985, Statistical Inference for Engineers and Data Scientists.

[3] Michael D. Vose,et al. A Linear Algorithm For Generating Random Numbers With a Given Distribution , 1991, IEEE Trans. Software Eng..

[4] R Jessup,et al. A Parallel Algorithm for Computing the Singular Value Decomposition of a Matrix:A Revision of Argonne National Laboratory Tech. Report ANL/MCS-TM-102 ; CU-CS-623-92 , 1994 .

[5] Stanley C. Eisenstat,et al. A Divide-and-Conquer Algorithm for the Bidiagonal SVD , 1995, SIAM J. Matrix Anal. Appl..

[6] Ming Gu,et al. Efficient Algorithms for Computing a Strong Rank-Revealing QR Factorization , 1996, SIAM J. Sci. Comput..

[7] Jack J. Dongarra,et al. A Parallel Divide and Conquer Algorithm for the Symmetric Eigenvalue Problem on Distributed Memory Architectures , 1999, SIAM J. Sci. Comput..

[8] N. Samatova,et al. Principal Component Analysis for Dimension Reduction in Massive Distributed Data Sets ∗ , 2002 .

[9] Dimitris Achlioptas,et al. Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[10] Franklin T. Luk,et al. Principal Component Analysis for Distributed Data Sets with Updating , 2005, APPT.

[11] Santosh S. Vempala,et al. Matrix approximation and projective clustering via volume sampling , 2006, SODA '06.

[12] Tamás Sarlós,et al. Improved Approximation Algorithms for Large Matrices via Random Projections , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[13] Santosh S. Vempala,et al. Adaptive Sampling and Fast Low-Rank Matrix Approximation , 2006, APPROX-RANDOM.

[14] Terence Tao,et al. The condition number of a randomly perturbed matrix , 2007, STOC '07.

[15] Anatoli Torokhti,et al. Generalized Rank-Constrained Matrix Approximations , 2007, SIAM J. Matrix Anal. Appl..

[16] Sylvain Raybaud,et al. Distributed Principal Component Analysis for Wireless Sensor Networks , 2008, Sensors.

[17] Nir Ailon,et al. Fast Dimension Reduction Using Rademacher Series on Dual BCH Codes , 2008, SODA '08.

[18] Gianluca Bontempi,et al. Distributed Principal Component Analysis for Wireless Sensor , 2008 .

[19] David Kempe,et al. A decentralized algorithm for spectral analysis , 2008, J. Comput. Syst. Sci..

[20] David P. Woodruff,et al. Numerical linear algebra in the streaming model , 2009, STOC '09.

[21] Nikhil Srivastava,et al. Twice-ramanujan sparsifiers , 2008, STOC '09.

[22] Sergio Valcarcel Macua,et al. Consensus-based distributed principal component analysis in wireless sensor networks , 2010, 2010 IEEE 11th International Workshop on Signal Processing Advances in Wireless Communications (SPAWC).

[23] A. Rantzer,et al. On a generalized matrix approximation problem in the spectral norm , 2012 .

[24] David P. Woodruff,et al. Low rank approximation and regression in input sparsity time , 2012, STOC '13.

[25] James Demmel,et al. Direct QR factorizations for tall-and-skinny matrices in MapReduce architectures , 2013, 2013 IEEE International Conference on Big Data.

[26] Huy L. Nguyen,et al. OSNAP: Faster Numerical Linear Algebra Algorithms via Sparser Subspace Embeddings , 2012, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[27] Santosh S. Vempala,et al. Nimble Algorithms for Cloud Computing , 2013, ArXiv.

[28] Christos Boutsidis,et al. Improved Matrix Algorithms via the Subsampled Randomized Hadamard Transform , 2012, SIAM J. Matrix Anal. Appl..

[29] Edo Liberty,et al. Simple and deterministic matrix sketching , 2012, KDD.

[30] Michael W. Mahoney,et al. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression , 2012, STOC '13.

[31] Mohamed S. Kamel,et al. Distributed Column Subset Selection on MapReduce , 2013, 2013 IEEE 13th International Conference on Data Mining.

[32] Kunal Talwar,et al. On differentially private low rank approximation , 2013, SODA.

[33] Dan Feldman,et al. Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[34] Robert A. van de Geijn,et al. Elemental: A New Framework for Distributed Memory Dense Matrix Computations , 2013, TOMS.

[35] David P. Woodruff. Low Rank Approximation Lower Bounds in Row-Update Streams , 2014, NIPS.

[36] Jeff M. Phillips,et al. Relative Errors for Deterministic Low-Rank Matrix Approximations , 2013, SODA.

[37] David P. Woodruff,et al. Improved Distributed Principal Component Analysis , 2014, NIPS.

[38] Daniel M. Kane,et al. Sparser Johnson-Lindenstrauss Transforms , 2010, JACM.

[39] Christos Boutsidis,et al. Near-Optimal Column-Based Matrix Reconstruction , 2014, SIAM J. Comput..

[40] Santosh S. Vempala,et al. Principal Component Analysis and Higher Correlations for Distributed Data , 2013, COLT.

[41] David P. Woodruff. Sketching as a Tool for Numerical Linear Algebra , 2014, Found. Trends Theor. Comput. Sci..

[42] Christos Boutsidis,et al. Optimal CUR matrix decompositions , 2014, STOC.

[43] Prateek Jain,et al. Tighter Low-rank Approximation via Sampling the Leveraged Element , 2015, SODA.

[44] Le Song,et al. Distributed Kernel Principal Component Analysis , 2015, ArXiv.

[45] Sjoerd Dirksen,et al. Toward a unified theory of sparse dimensionality reduction in Euclidean space , 2013, STOC.