Dimensionality Reduction for k-Means Clustering and Low Rank Approximation

We show how to approximate a data matrix A with a much smaller sketch ~A that can be used to solve a general class of constrained k-rank approximation problems to within (1+ε) error. Importantly, this class includes k-means clustering and unconstrained low rank approximation (i.e. principal component analysis). By reducing data points to just O(k) dimensions, we generically accelerate any exact, approximate, or heuristic algorithm for these ubiquitous problems. For k-means dimensionality reduction, we provide (1+ε) relative error results for many common sketching techniques, including random row projection, column selection, and approximate SVD. For approximate principal component analysis, we give a simple alternative to known algorithms that has applications in the streaming setting. Additionally, we extend recent work on column-based matrix reconstruction, giving column subsets that not only 'cover' a good subspace for A}, but can be used directly to compute this subspace. Finally, for k-means clustering, we show how to achieve a (9+ε) approximation by Johnson-Lindenstrauss projecting data to just O(log k/ε2) dimensions. This is the first result that leverages the specific structure of k-means to achieve dimension independent of input size and sublinear in k.

[1]  L. Mirsky SYMMETRIC GAUGE FUNCTIONS AND UNITARILY INVARIANT NORMS , 1960 .

[2]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[3]  Mary Inaba,et al.  Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering: (extended abstract) , 1994, SCG '94.

[4]  M. Inaba Application of weighted Voronoi diagrams and randomization to variance-based k-clustering , 1994, SoCG 1994.

[5]  Alan M. Frieze,et al.  Clustering in large graphs and matrices , 1999, SODA '99.

[6]  David M. Mount,et al.  A local search approximation algorithm for k-means clustering , 2002, SCG '02.

[7]  Dimitris Achlioptas,et al.  Database-friendly random projections: Johnson-Lindenstrauss with binary coins , 2003, J. Comput. Syst. Sci..

[8]  Alan M. Frieze,et al.  Fast monte-carlo algorithms for finding low-rank approximations , 2004, JACM.

[9]  Alan M. Frieze,et al.  Clustering Large Graphs via the Singular Value Decomposition , 2004, Machine Learning.

[10]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[11]  Amit Kumar,et al.  A simple linear time (1 + /spl epsiv/)-approximation algorithm for k-means clustering in any dimensions , 2004, 45th Annual IEEE Symposium on Foundations of Computer Science.

[12]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[13]  Santosh S. Vempala,et al.  Matrix approximation and projective clustering via volume sampling , 2006, SODA '06.

[14]  Tamás Sarlós,et al.  Improved Approximation Algorithms for Large Matrices via Random Projections , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[15]  Petros Drineas,et al.  Fast Monte Carlo Algorithms for Matrices II: Computing a Low-Rank Approximation to a Matrix , 2006, SIAM J. Comput..

[16]  Petros Drineas,et al.  Fast Monte Carlo Algorithms for Matrices III: Computing a Compressed Approximate Matrix Decomposition , 2006, SIAM J. Comput..

[17]  Petros Drineas,et al.  Fast Monte Carlo Algorithms for Matrices I: Approximating Matrix Multiplication , 2006, SIAM J. Comput..

[18]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[19]  Philip S. Yu,et al.  Top 10 algorithms in data mining , 2007, Knowledge and Information Systems.

[20]  M. Rudelson,et al.  The smallest singular value of a random rectangular matrix , 2008, 0802.3956.

[21]  David P. Woodruff,et al.  Numerical linear algebra in the streaming model , 2009, STOC '09.

[22]  Pierre Hansen,et al.  NP-hardness of Euclidean sum-of-squares clustering , 2008, Machine Learning.

[23]  Meena Mahajan,et al.  The Planar k-means Problem is NP-hard I , 2009 .

[24]  Christos Boutsidis,et al.  Unsupervised Feature Selection for the $k$-means Clustering Problem , 2009, NIPS.

[25]  Nikhil Srivastava,et al.  Twice-ramanujan sparsifiers , 2008, STOC '09.

[26]  Christos Boutsidis,et al.  Random Projections for $k$-means Clustering , 2010, NIPS.

[27]  Christos Boutsidis,et al.  Near Optimal Column-Based Matrix Reconstruction , 2011, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[28]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[29]  Charles Elkan,et al.  Fast Algorithms for Approximating the Singular Value Decomposition , 2011, TKDD.

[30]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC '11.

[31]  Nathan Halko,et al.  Finding Structure with Randomness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions , 2009, SIAM Rev..

[32]  David P. Woodruff,et al.  Low rank approximation and regression in input sparsity time , 2012, STOC '13.

[33]  Daniel J. Hsu,et al.  Tail inequalities for sums of random matrices that depend on the intrinsic dimension , 2012 .

[34]  Venkatesan Guruswami,et al.  Optimal column-based low-rank matrix reconstruction , 2011, SODA.

[35]  Xiao-Tong Yuan,et al.  Truncated power method for sparse eigenvalue problems , 2011, J. Mach. Learn. Res..

[36]  Huy L. Nguyen,et al.  OSNAP: Faster Numerical Linear Algebra Algorithms via Sparser Subspace Embeddings , 2012, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[37]  Yingyu Liang,et al.  Distributed k-Means and k-Median Clustering on General Topologies , 2013, NIPS 2013.

[38]  Gary L. Miller,et al.  Iterative Row Sampling , 2012, 2013 IEEE 54th Annual Symposium on Foundations of Computer Science.

[39]  Edo Liberty,et al.  Simple and deterministic matrix sketching , 2012, KDD.

[40]  Christos Boutsidis,et al.  Deterministic Feature Selection for K-Means Clustering , 2011, IEEE Transactions on Information Theory.

[41]  Michael W. Mahoney,et al.  Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression , 2012, STOC '13.

[42]  Dimitris S. Papailiopoulos,et al.  Sparse PCA through Low-rank Approximations , 2013, ICML.

[43]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[44]  Yingyu Liang,et al.  Distributed PCA and k-Means Clustering , 2013 .

[45]  Jelani Nelson,et al.  Cs 229r: Algorithms for Big Data , 2013 .

[46]  David P. Woodruff Low Rank Approximation Lower Bounds in Row-Update Streams , 2014, NIPS.

[47]  Sariel Har-Peled,et al.  Low Rank Matrix Approximation in Linear Time , 2014, ArXiv.

[48]  Jeff M. Phillips,et al.  Relative Errors for Deterministic Low-Rank Matrix Approximations , 2013, SODA.

[49]  David P. Woodruff,et al.  Improved Distributed Principal Component Analysis , 2014, NIPS.

[50]  Dimitris S. Papailiopoulos,et al.  Nonnegative Sparse PCA with Provable Guarantees , 2014, ICML.

[51]  Mark Tygert,et al.  An implementation of a randomized algorithm for principal component analysis , 2014, ArXiv.

[52]  Daniel M. Kane,et al.  Sparser Johnson-Lindenstrauss Transforms , 2010, JACM.

[53]  Santosh S. Vempala,et al.  Principal Component Analysis and Higher Correlations for Distributed Data , 2013, COLT.

[54]  David P. Woodruff Sketching as a Tool for Numerical Linear Algebra , 2014, Found. Trends Theor. Comput. Sci..

[55]  Christos Boutsidis,et al.  Optimal CUR matrix decompositions , 2014, STOC.

[56]  Prateek Jain,et al.  Tighter Low-rank Approximation via Sampling the Leveraged Element , 2015, SODA.

[57]  Richard Peng,et al.  Uniform Sampling for Matrix Approximation , 2014, ITCS.

[58]  Cameron Musco,et al.  Dimensionality reduction for k-means clustering , 2015 .

[59]  Christos Boutsidis,et al.  Randomized Dimensionality Reduction for $k$ -Means Clustering , 2011, IEEE Transactions on Information Theory.

[60]  David P. Woodruff,et al.  Optimal Approximate Matrix Product in Terms of Stable Rank , 2015, ICALP.