Probabilistic k-Median Clustering in Data Streams

The focus of our work is introducing and constructing probabilistic coresets. A probabilistic coreset can contain probabilistic points, and the number of these points should be polylogarithmic in the input size. However, the overall storage size is also influenced by representation size of the propability distribution of each point. So, our first observation is that the size of probabilistic coresets shall be restricted in the number of points and in the representation size of the points. We propose the first (k, ε)-coreset constructions for the probabilistic k-median problem in the metric and Euclidean case. The coresets are of size poly(ε−1, k, log(W/(pmin⋅δ))), where W is the expected total weight of the weighted probabilistic input points when all weights are scaled to be at least one, pmin is the probability of a point to be realized at a certain location, and δ is the error probability of the construction. Our coreset for the Euclidean problem can be maintained in data streams.

[1]  Leonidas J. Guibas,et al.  A metric for distributions with applications to image databases , 1998, Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271).

[2]  Satish Rao,et al.  A Nearly Linear-Time Approximation Scheme for the Euclidean k-Median Problem , 2007, SIAM J. Comput..

[3]  Jon Louis Bentley,et al.  Decomposable Searching Problems I: Static-to-Dynamic Transformation , 1980, J. Algorithms.

[4]  E. Forgy,et al.  Cluster analysis of multivariate data : efficiency versus interpretability of classifications , 1965 .

[5]  Ke Chen,et al.  On Coresets for k-Median and k-Means Clustering in Metric and Euclidean Spaces and Their Applications , 2009, SIAM J. Comput..

[6]  Piotr Indyk,et al.  Approximate clustering via core-sets , 2002, STOC '02.

[7]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[8]  Amit Kumar,et al.  Linear-time approximation schemes for clustering problems in any dimensions , 2010, JACM.

[9]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[10]  Richard M. Karp,et al.  Theoretical Improvements in Algorithmic Efficiency for Network Flow Problems , 1972, Combinatorial Optimization.

[11]  Sudipto Guha,et al.  Improved Combinatorial Algorithms for Facility Location Problems , 2005, SIAM J. Comput..

[12]  Kamesh Munagala,et al.  Local Search Heuristics for k-Median and Facility Location Problems , 2004, SIAM J. Comput..

[13]  Sudipto Guha,et al.  A constant-factor approximation algorithm for the k-median problem (extended abstract) , 1999, STOC '99.

[14]  Thomas Seidl,et al.  Subspace Clustering for Uncertain Data , 2010, SDM.

[15]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[16]  Amin Saberi,et al.  A new greedy approach for facility location problems , 2002, STOC '02.

[17]  Piotr Indyk,et al.  Sublinear time algorithms for metric space problems , 1999, STOC '99.

[18]  Pankaj K. Agarwal,et al.  Approximating extent measures of points , 2004, JACM.

[19]  Dan Feldman,et al.  A PTAS for k-means clustering based on weak coresets , 2007, SCG '07.

[20]  Huajie Xu,et al.  Density-Based Probabilistic Clustering of Uncertain Data , 2008, 2008 International Conference on Computer Science and Software Engineering.

[21]  Sudipto Guha,et al.  Exceeding expectations and clustering uncertain data , 2009, PODS.

[22]  Christian Sohler,et al.  Coresets in dynamic geometric data streams , 2005, STOC '05.

[23]  Graham Cormode,et al.  Approximation algorithms for clustering uncertain data , 2008, PODS.

[24]  Philip S. Yu,et al.  A Survey of Uncertain Data Algorithms and Applications , 2009, IEEE Transactions on Knowledge and Data Engineering.

[25]  Reynold Cheng,et al.  Uncertain Data Mining: An Example in Clustering Location Data , 2006, PAKDD.

[26]  Hans-Peter Kriegel,et al.  Density-based clustering of uncertain data , 2005, KDD '05.

[27]  C. Greg Plaxton,et al.  Optimal Time Bounds for Approximate Clustering , 2002, Machine Learning.

[28]  Hans-Peter Kriegel,et al.  Hierarchical density-based clustering of uncertain data , 2005, Fifth IEEE International Conference on Data Mining (ICDM'05).

[29]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[30]  Satish Rao,et al.  Approximation schemes for Euclidean k-medians and related problems , 1998, STOC '98.

[31]  Vijay V. Vazirani,et al.  Primal-dual approximation algorithms for metric facility location and k-median problems , 1999, 40th Annual Symposium on Foundations of Computer Science (Cat. No.99CB37039).

[32]  Sanjeev Arora,et al.  Polynomial time approximation schemes for Euclidean traveling salesman and other geometric problems , 1998, JACM.

[33]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[34]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[35]  Reynold Cheng,et al.  Efficient Clustering of Uncertain Data , 2006, Sixth International Conference on Data Mining (ICDM'06).