论文信息 - k-Means for Streaming and Distributed Big Sparse Data

k-Means for Streaming and Distributed Big Sparse Data

We provide the first streaming algorithm for computing a provable approximation to the $k$-means of sparse Big data. Here, sparse Big Data is a set of $n$ vectors in $\mathbb{R}^d$, where each vector has $O(1)$ non-zeroes entries, and $d\geq n$. E.g., adjacency matrix of a graph, web-links, social network, document-terms, or image-features matrices. Our streaming algorithm stores at most $\log n\cdot k^{O(1)}$ input points in memory. If the stream is distributed among $M$ machines, the running time reduces by a factor of $M$, while communicating a total of $M\cdot k^{O(1)}$ (sparse) input points between the machines. % Our main technical result is a deterministic algorithm for computing a sparse $(k,\epsilon)$-coreset, which is a weighted subset of $k^{O(1)}$ input points that approximates the sum of squared distances from the $n$ input points to every $k$ centers, up to $(1\pm\epsilon)$ factor, for any given constant $\epsilon>0$. This is the first such coreset of size independent of both $d$ and $n$. Existing algorithms use coresets of size at least polynomial in $d$, or project the input points on a subspace which diminishes their sparsity, thus require memory and communication $\Omega(d)=\Omega(n)$ even for $k=2$. Experimental results real public datasets shows that our algorithm boost the performance of such given heuristics even in the off-line setting. Open code is provided for reproducibility.

Artem Barger | Dan Feldman | Dan Feldman | Artem Barger

[1] Pankaj K. Agarwal,et al. Approximating extent measures of points , 2004, JACM.

[2] Meena Mahajan,et al. The Planar k-means Problem is NP-hard I , 2009 .

[3] Jon Louis Bentley,et al. Decomposable Searching Problems I: Static-to-Dynamic Transformation , 1980, J. Algorithms.

[4] Dana H. Ballard,et al. Generalizing the Hough transform to detect arbitrary shapes , 1981, Pattern Recognit..

[5] Andreas Krause,et al. Scalable Training of Mixture Models via Coresets , 2011, NIPS.

[6] Sergei Vassilvitskii,et al. k-means++: the advantages of careful seeding , 2007, SODA '07.

[7] W. B. Johnson,et al. Extensions of Lipschitz mappings into Hilbert space , 1984 .

[8] Zvi Drezner,et al. Facility location - applications and theory , 2001 .

[9] Michael B. Cohen,et al. Dimensionality Reduction for k-Means Clustering and Low Rank Approximation , 2014, STOC.

[10] L. Schulman,et al. Universal ε-approximators for integrals , 2010, SODA '10.

[11] Sariel Har-Peled,et al. Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[12] Sariel Har-Peled,et al. On coresets for k-means and k-median clustering , 2004, STOC '04.

[13] Dan Feldman,et al. Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[14] Michael Langberg,et al. A unified framework for approximating and clustering data , 2011, STOC.

[15] Amos Fiat,et al. Coresets forWeighted Facilities and Their Applications , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[16] M. Inaba. Application of weighted Voronoi diagrams and randomization to variance-based k-clustering , 1994, SoCG 1994.

[17] Michel Verleysen,et al. Nonlinear Dimensionality Reduction , 2021, Computer Vision.

[18] Dan Feldman,et al. A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[19] Ke Chen,et al. On k-Median clustering in high dimensions , 2006, SODA '06.

[20] Dan Feldman,et al. Dimensionality Reduction of Massive Sparse Datasets Using Coresets , 2015, NIPS.

[21] Yann LeCun,et al. The mnist database of handwritten digits , 2005 .

[22] Christos Boutsidis,et al. Greedy Minimization of Weakly Supermodular Set Functions , 2015, APPROX-RANDOM.