Streaming Coreset Constructions for M-Estimators

We introduce a new method of maintaining a (k, )-coreset for clustering M -estimators over insertiononly streams. Let (P,w) be a weighted set (where w : P → [0,∞) is the weight function) of points in a ρ-metric space (meaning a set X equipped with a positive-semidefinite symmetric function D such that D(x, z) ≤ ρ(D(x, y) + D(y, z)) for all x, y, z ∈ X ). For any set of points C, we define COST(P,w,C) = ∑ p∈P w(p) minc∈C D(p, c). A (k, )-coreset for (P,w) is a weighted set (Q, v) such that for every set C of k points, (1 − )COST(P,w,C) ≤ COST(Q, v, C) ≤ (1 + )COST(P,w,C). Essentially, the coreset (Q, v) can be used in place of (P,w) for all operations concerning the COST function. Coresets, as a method of data reduction, are used to solve fundamental problems in machine learning of streaming and distributed data. M -estimators are functions D(x, y) that can be written as ψ(d(x, y)) where (X , d) is a true metric (i.e. 1-metric) space. Special cases of M -estimators include the well-known k-median (ψ(x) = x) and k-means (ψ(x) = x2) functions. Our technique takes an existing offline construction for an M -estimator coreset and converts it into the streaming setting, where n data points arrive sequentially. To our knowledge, this is the first streaming construction for any M -estimator that does not rely on the merge-and-reduce tree. For example, our coreset for streaming metric k-means uses O( −2k log k logn) points of storage. The previous state-of-the-art required storing at least O( −2k log k log4 n) points. 2012 ACM Subject Classification Theory of computation → Streaming models; Theory of computation → Facility location and clustering; Information systems → Query optimization

[1]  Vladimir Braverman,et al.  New Frameworks for Offline and Streaming Coreset Constructions , 2016, ArXiv.

[2]  Rina Panigrahy,et al.  Better streaming algorithms for clustering problems , 2003, STOC '03.

[3]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[4]  Michael Langberg,et al.  A unified framework for approximating and clustering data , 2011, STOC.

[5]  Elvezio Ronchetti,et al.  A smoothing principle for the Huber and other location M-estimators , 2011, Comput. Stat. Data Anal..

[6]  Kamesh Munagala,et al.  Local search heuristic for k-median and facility location problems , 2001, STOC '01.

[7]  Christian Sohler,et al.  StreamKM++: A clustering algorithm for data streams , 2010, JEAL.

[8]  Aravind Srinivasan,et al.  An Improved Approximation for k-Median and Positive Correlation in Budgeted Optimization , 2014, SODA.

[9]  Ke Chen,et al.  On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications , 2009, SIAM J. Comput..

[10]  Rafail Ostrovsky,et al.  Streaming k-means on well-clusterable data , 2011, SODA '11.

[11]  David P. Woodruff,et al.  Sketching for M-Estimators: A Unified Approach to Robust Regression , 2015, SODA.

[12]  Christian Sohler,et al.  Coresets in dynamic geometric data streams , 2005, STOC '05.

[13]  Dan Feldman,et al.  Data reduction for weighted and outlier-resistant clustering , 2012, SODA.

[14]  Jon Louis Bentley,et al.  Decomposable Searching Problems I: Static-to-Dynamic Transformation , 1980, J. Algorithms.

[15]  Philip S. Yu,et al.  A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[16]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[17]  Ravishankar Krishnaswamy,et al.  The Hardness of Approximation of Euclidean k-Means , 2015, SoCG.