论文信息 - New Frameworks for Offline and Streaming Coreset Constructions - 字舞流文

New Frameworks for Offline and Streaming Coreset Constructions

Let $P$ be a set (called points), $Q$ be a set (called queries) and a function $ f:P\times Q\to [0,\infty)$ (called cost). For an error parameter $\epsilon>0$, a set $S\subseteq P$ with a \emph{weight function} $w:P \rightarrow [0,\infty)$ is an $\epsilon$-coreset if $\sum_{s\in S}w(s) f(s,q)$ approximates $\sum_{p\in P} f(p,q)$ up to a multiplicative factor of $1\pm\epsilon$ for every given query $q\in Q$. We construct coresets for the $k$-means clustering of $n$ input points, both in an arbitrary metric space and $d$-dimensional Euclidean space. For Euclidean space, we present the first coreset whose size is simultaneously independent of both $d$ and $n$. In particular, this is the first coreset of size $o(n)$ for a stream of $n$ sparse points in a $d \ge n$ dimensional space (e.g. adjacency matrices of graphs). We also provide the first generalizations of such coresets for handling outliers. For arbitrary metric spaces, we improve the dependence on $k$ to $k \log k$ and present a matching lower bound. For $M$-estimator clustering (special cases include the well-known $k$-median and $k$-means clustering), we introduce a new technique for converting an offline coreset construction to the streaming setting. Our method yields streaming coreset algorithms requiring the storage of $O(S + k \log n)$ points, where $S$ is the size of the offline coreset. In comparison, the previous state-of-the-art was the merge-and-reduce technique that required $O(S \log^{2a+1} n)$ points, where $a$ is the exponent in the offline construction's dependence on $\epsilon^{-1}$. For example, combining our offline and streaming results, we produce a streaming metric $k$-means coreset algorithm using $O(\epsilon^{-2} k \log k \log n)$ points of storage. The previous state-of-the-art required $O(\epsilon^{-4} k \log k \log^{6} n)$ points.

Vladimir Braverman | Dan Feldman | Harry Lang | V. Braverman | Dan Feldman | Harry Lang

[1] Pankaj K. Agarwal,et al. Approximating extent measures of points , 2004, JACM.

[2] Ke Chen,et al. A constant factor approximation algorithm for k-median clustering with outliers , 2008, SODA '08.

[3] Dan Feldman,et al. A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[4] Jirí Matousek. Construction of epsilon nets , 1989, SCG '89.

[5] Andreas Krause,et al. Coresets for Nonparametric Estimation - the Case of DP-Means , 2015, ICML.

[6] Sudipto Guha,et al. Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[7] Artem Barger,et al. k-Means for Streaming and Distributed Big Sparse Data , 2015, SDM.

[8] L. Schulman,et al. Universal ε-approximators for integrals , 2010, SODA '10.

[9] Christian Sohler,et al. Coresets in dynamic geometric data streams , 2005, STOC '05.

[10] Anirban Dasgupta,et al. Sampling algorithms and coresets for ℓp regression , 2007, SODA '08.

[11] Dan Feldman,et al. Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[12] Michael Langberg,et al. A unified framework for approximating and clustering data , 2011, STOC.

[13] Peter L. Bartlett,et al. Neural Network Learning - Theoretical Foundations , 1999 .

[14] David E. Tyler. A Distribution-Free $M$-Estimator of Multivariate Scatter , 1987 .

[15] Dan Feldman,et al. Data reduction for weighted and outlier-resistant clustering , 2012, SODA.

[16] Jon Louis Bentley,et al. Decomposable Searching Problems I: Static-to-Dynamic Transformation , 1980, J. Algorithms.

[17] Vladimir Vapnik,et al. Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[18] Rina Panigrahy,et al. Better streaming algorithms for clustering problems , 2003, STOC '03.

[19] Christian Sohler,et al. StreamKM++: A clustering algorithm for data streams , 2010, JEAL.

[20] Michael B. Cohen,et al. Dimensionality Reduction for k-Means Clustering and Low Rank Approximation , 2014, STOC.

[21] Ke Chen,et al. On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications , 2009, SIAM J. Comput..

[22] Yi Li,et al. Improved bounds on the sample complexity of learning , 2000, SODA '00.

[23] Dimitris Papailiopoulos,et al. Provable deterministic leverage score sampling , 2014, KDD.

[24] B. Ripley,et al. Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[25] Sariel Har-Peled,et al. Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[26] Sariel Har-Peled,et al. On coresets for k-means and k-median clustering , 2004, STOC '04.

[27] Rafail Ostrovsky,et al. Streaming k-means on well-clusterable data , 2011, SODA '11.

[28] Andreas Krause,et al. Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures , 2015, AISTATS.

[29] Elvezio Ronchetti,et al. A smoothing principle for the Huber and other location M-estimators , 2011, Comput. Stat. Data Anal..

[30] Andreas Krause,et al. Scalable Training of Mixture Models via Coresets , 2011, NIPS.

[31] Sudipto Guha. Tight results for clustering and summarizing data streams , 2009, ICDT '09.

[32] Kasturi R. Varadarajan,et al. Efficient Subspace Approximation Algorithms , 2007, Discrete & Computational Geometry.