论文信息 - Streaming Coreset Constructions for M-Estimators

Streaming Coreset Constructions for M-Estimators

We introduce a new method of maintaining a (k, )-coreset for clustering M -estimators over insertiononly streams. Let (P,w) be a weighted set (where w : P → [0,∞) is the weight function) of points in a ρ-metric space (meaning a set X equipped with a positive-semidefinite symmetric function D such that D(x, z) ≤ ρ(D(x, y) + D(y, z)) for all x, y, z ∈ X ). For any set of points C, we define COST(P,w,C) = ∑ p∈P w(p) minc∈C D(p, c). A (k, )-coreset for (P,w) is a weighted set (Q, v) such that for every set C of k points, (1 − )COST(P,w,C) ≤ COST(Q, v, C) ≤ (1 + )COST(P,w,C). Essentially, the coreset (Q, v) can be used in place of (P,w) for all operations concerning the COST function. Coresets, as a method of data reduction, are used to solve fundamental problems in machine learning of streaming and distributed data. M -estimators are functions D(x, y) that can be written as ψ(d(x, y)) where (X , d) is a true metric (i.e. 1-metric) space. Special cases of M -estimators include the well-known k-median (ψ(x) = x) and k-means (ψ(x) = x2) functions. Our technique takes an existing offline construction for an M -estimator coreset and converts it into the streaming setting, where n data points arrive sequentially. To our knowledge, this is the first streaming construction for any M -estimator that does not rely on the merge-and-reduce tree. For example, our coreset for streaming metric k-means uses O( −2k log k logn) points of storage. The previous state-of-the-art required storing at least O( −2k log k log4 n) points. 2012 ACM Subject Classification Theory of computation → Streaming models; Theory of computation → Facility location and clustering; Information systems → Query optimization

[1] Vladimir Braverman,et al. New Frameworks for Offline and Streaming Coreset Constructions , 2016, ArXiv.

[2] Rina Panigrahy,et al. Better streaming algorithms for clustering problems , 2003, STOC '03.

[3] Sudipto Guha,et al. Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[4] Michael Langberg,et al. A unified framework for approximating and clustering data , 2011, STOC.

[5] Elvezio Ronchetti,et al. A smoothing principle for the Huber and other location M-estimators , 2011, Comput. Stat. Data Anal..

[6] Kamesh Munagala,et al. Local search heuristic for k-median and facility location problems , 2001, STOC '01.

[7] Christian Sohler,et al. StreamKM++: A clustering algorithm for data streams , 2010, JEAL.

[8] Aravind Srinivasan,et al. An Improved Approximation for k-Median and Positive Correlation in Budgeted Optimization , 2014, SODA.

[9] Ke Chen,et al. On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications , 2009, SIAM J. Comput..

[10] Rafail Ostrovsky,et al. Streaming k-means on well-clusterable data , 2011, SODA '11.

[11] David P. Woodruff,et al. Sketching for M-Estimators: A Unified Approach to Robust Regression , 2015, SODA.

[12] Christian Sohler,et al. Coresets in dynamic geometric data streams , 2005, STOC '05.

[13] Dan Feldman,et al. Data reduction for weighted and outlier-resistant clustering , 2012, SODA.

[14] Jon Louis Bentley,et al. Decomposable Searching Problems I: Static-to-Dynamic Transformation , 1980, J. Algorithms.

[15] Philip S. Yu,et al. A Framework for Projected Clustering of High Dimensional Data Streams , 2004, VLDB.

[16] Sariel Har-Peled,et al. Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[17] Ravishankar Krishnaswamy,et al. The Hardness of Approximation of Euclidean k-Means , 2015, SoCG.