New Frameworks for Offline and Streaming Coreset Constructions

Let $P$ be a set (called points), $Q$ be a set (called queries) and a function $ f:P\times Q\to [0,\infty)$ (called cost). For an error parameter $\epsilon>0$, a set $S\subseteq P$ with a \emph{weight function} $w:P \rightarrow [0,\infty)$ is an $\epsilon$-coreset if $\sum_{s\in S}w(s) f(s,q)$ approximates $\sum_{p\in P} f(p,q)$ up to a multiplicative factor of $1\pm\epsilon$ for every given query $q\in Q$. We construct coresets for the $k$-means clustering of $n$ input points, both in an arbitrary metric space and $d$-dimensional Euclidean space. For Euclidean space, we present the first coreset whose size is simultaneously independent of both $d$ and $n$. In particular, this is the first coreset of size $o(n)$ for a stream of $n$ sparse points in a $d \ge n$ dimensional space (e.g. adjacency matrices of graphs). We also provide the first generalizations of such coresets for handling outliers. For arbitrary metric spaces, we improve the dependence on $k$ to $k \log k$ and present a matching lower bound. For $M$-estimator clustering (special cases include the well-known $k$-median and $k$-means clustering), we introduce a new technique for converting an offline coreset construction to the streaming setting. Our method yields streaming coreset algorithms requiring the storage of $O(S + k \log n)$ points, where $S$ is the size of the offline coreset. In comparison, the previous state-of-the-art was the merge-and-reduce technique that required $O(S \log^{2a+1} n)$ points, where $a$ is the exponent in the offline construction's dependence on $\epsilon^{-1}$. For example, combining our offline and streaming results, we produce a streaming metric $k$-means coreset algorithm using $O(\epsilon^{-2} k \log k \log n)$ points of storage. The previous state-of-the-art required $O(\epsilon^{-4} k \log k \log^{6} n)$ points.

[1]  Pankaj K. Agarwal,et al.  Approximating extent measures of points , 2004, JACM.

[2]  Ke Chen,et al.  A constant factor approximation algorithm for k-median clustering with outliers , 2008, SODA '08.

[3]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[4]  Jirí Matousek Construction of epsilon nets , 1989, SCG '89.

[5]  Andreas Krause,et al.  Coresets for Nonparametric Estimation - the Case of DP-Means , 2015, ICML.

[6]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[7]  Artem Barger,et al.  k-Means for Streaming and Distributed Big Sparse Data , 2015, SDM.

[8]  L. Schulman,et al.  Universal ε-approximators for integrals , 2010, SODA '10.

[9]  Christian Sohler,et al.  Coresets in dynamic geometric data streams , 2005, STOC '05.

[10]  Anirban Dasgupta,et al.  Sampling algorithms and coresets for ℓp regression , 2007, SODA '08.

[11]  Dan Feldman,et al.  Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering , 2013, SODA.

[12]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[13]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[14]  David E. Tyler A Distribution-Free $M$-Estimator of Multivariate Scatter , 1987 .

[15]  Dan Feldman,et al.  Data reduction for weighted and outlier-resistant clustering , 2012, SODA.

[16]  Jon Louis Bentley,et al.  Decomposable Searching Problems I: Static-to-Dynamic Transformation , 1980, J. Algorithms.

[17]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[18]  Rina Panigrahy,et al.  Better streaming algorithms for clustering problems , 2003, STOC '03.

[19]  Christian Sohler,et al.  StreamKM++: A clustering algorithm for data streams , 2010, JEAL.

[20]  Michael B. Cohen,et al.  Dimensionality Reduction for k-Means Clustering and Low Rank Approximation , 2014, STOC.

[21]  Ke Chen,et al.  On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications , 2009, SIAM J. Comput..

[22]  Yi Li,et al.  Improved bounds on the sample complexity of learning , 2000, SODA '00.

[23]  Dimitris Papailiopoulos,et al.  Provable deterministic leverage score sampling , 2014, KDD.

[24]  B. Ripley,et al.  Robust Statistics , 2018, Encyclopedia of Mathematical Geosciences.

[25]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[26]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[27]  Rafail Ostrovsky,et al.  Streaming k-means on well-clusterable data , 2011, SODA '11.

[28]  Andreas Krause,et al.  Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures , 2015, AISTATS.

[29]  Elvezio Ronchetti,et al.  A smoothing principle for the Huber and other location M-estimators , 2011, Comput. Stat. Data Anal..

[30]  Andreas Krause,et al.  Scalable Training of Mixture Models via Coresets , 2011, NIPS.

[31]  Sudipto Guha Tight results for clustering and summarizing data streams , 2009, ICDT '09.

[32]  Kasturi R. Varadarajan,et al.  Efficient Subspace Approximation Algorithms , 2007, Discrete & Computational Geometry.