Faster Coreset Construction for Projective Clustering via Low-Rank Approximation

In this work, we present a randomized coreset construction for projective clustering, which involves computing a set of $k$ closest $j$-dimensional linear (affine) subspaces of a given set of $n$ vectors in $d$ dimensions. Let $A \in \mathbb{R}^{n\times d}$ be an input matrix. An earlier deterministic coreset construction of Feldman \textit{et. al.} relied on computing the SVD of $A$. The best known algorithms for SVD require $\min\{nd^2, n^2d\}$ time, which may not be feasible for large values of $n$ and $d$. We present a coreset construction by projecting the rows of matrix $A$ on some orthonormal vectors that closely approximate the right singular vectors of $A$. As a consequence, when the values of $k$ and $j$ are small, we are able to achieve a faster algorithm, as compared to the algorithm of Feldman \textit{et. al.}, while maintaining almost the same approximation. We also benefit in terms of space as well as exploit the sparsity of the input dataset. Another advantage of our approach is that it can be constructed in a streaming setting quite efficiently.

[1]  Tamás Sarlós,et al.  Improved Approximation Algorithms for Large Matrices via Random Projections , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[2]  Michael W. Mahoney Randomized Algorithms for Matrices and Data , 2011, Found. Trends Mach. Learn..

[3]  Sharad Mehrotra,et al.  Local Dimensionality Reduction: A New Approach to Indexing High Dimensional Spaces , 2000, VLDB.

[4]  Xin Xiao,et al.  A near-linear algorithm for projective clustering integer points , 2012, SODA.

[5]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[6]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[7]  Kasturi R. Varadarajan,et al.  Geometric Approximation via Coresets , 2007 .

[8]  Pankaj K. Agarwal,et al.  Approximating extent measures of points , 2004, JACM.

[9]  Michael B. Cohen,et al.  Dimensionality Reduction for k-Means Clustering and Low Rank Approximation , 2014, STOC.

[10]  Prabhakar Raghavan,et al.  Computing on data streams , 1999, External Memory Algorithms.

[11]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[12]  Jeff M. Phillips,et al.  Coresets and Sketches , 2016, ArXiv.

[13]  Tian Zhang,et al.  BIRCH: A New Data Clustering Algorithm and Its Applications , 1997, Data Mining and Knowledge Discovery.

[14]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[15]  David P. Woodruff,et al.  Low rank approximation and regression in input sparsity time , 2013, STOC '13.

[16]  Christos Boutsidis,et al.  An improved approximation algorithm for the column subset selection problem , 2008, SODA.

[17]  David P. Woodruff,et al.  Coresets and sketches for high dimensional subspace approximation problems , 2010, SODA '10.

[18]  Sariel Har-Peled,et al.  No, Coreset, No Cry , 2004, FSTTCS.

[19]  Amos Fiat,et al.  Coresets forWeighted Facilities and Their Applications , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[20]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[21]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[22]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.