Coresets in dynamic geometric data streams

A dynamic geometric data stream consists of a sequence of <i>m</i> insert/delete operations of points from the discrete space 1,…,Δ<i><sup>d</sup></i> [26]. We develop streaming (1 + ε)-approximation algorithms for <i>k</i>-median, <i>k</i>-means, MaxCut, maximum weighted matching (MaxWM), maximum travelling salesperson (MaxTSP), maximum spanning tree (MaxST), and average distance over dynamic geometric data streams. Our algorithms maintain a small weighted set of points(a coreset) that approximates with probability 2/3 the current point set with respect to the considered problem during the <i>m</i> insert/delete operations of the data stream. They use poly (ε<sup>-1</sup>, log <i>m</i>, log Δ) space and update time per insert/delete operation for constant <i>k</i> and dimension <i>d</i>Having a coreset one only needs a fast approximation algorithm for the weighted problem to compute a solution quickly. In fact, even an exponential algorithm is sometimes feasible as its running time may still be polynomial in <i>n</i>. For example one can compute in poly(log <i>n</i>, exp(<i>O</i>((1+log (1⁄ε)⁄ε)<sup><i>d</i>-1</sup>))) time a solution to <i>k</i>-median and <i>k</i>-means [21] where <i>n</i> is the size of the current point set and <i>k</i> and <i>d</i> are constants. Finding an implicit solution to MaxCut can be done in poly(log <i>n</i>, exp((1⁄ε)<sup>O(1)</sup>)) time. For MaxST and average distance we require poly(log <i>n</i>, ε<sup>-1</sup>) time and for MaxWM we require O(<i>n</i><sup>3</sup>) time to do this.

[1]  Joan Feigenbaum,et al.  Computing Diameter in the Streaming and Sliding-Window Models , 2002, Algorithmica.

[2]  Timothy M. Chan,et al.  Geometric Optimization Problems over Sliding Windows , 2004, Int. J. Comput. Geom. Appl..

[3]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[4]  Aravind Srinivasan,et al.  Chernoff-Hoeffding bounds for applications with limited independence , 1995, SODA '93.

[5]  Csaba D. Tóth,et al.  Range Counting over Multidimensional Data Streams , 2004, SCG '04.

[6]  Piotr Indyk,et al.  Sampling in dynamic data streams and applications , 2005, Int. J. Comput. Geom. Appl..

[7]  Rina Panigrahy,et al.  Better streaming algorithms for clustering problems , 2003, STOC '03.

[8]  David Eppstein,et al.  Deterministic sampling and range counting in geometric data streams , 2003, TALG.

[9]  Piotr Indyk,et al.  Better algorithms for high-dimensional proximity problems via asymmetric embeddings , 2003, SODA '03.

[10]  Dana Ron,et al.  Property testing and its connection to learning and approximation , 1998, JACM.

[11]  Rajeev Rastogi,et al.  Processing set expressions over continuous update streams , 2003, SIGMOD '03.

[12]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[13]  Graham Cormode,et al.  Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling , 2005, VLDB.

[14]  Graham Cormode,et al.  An Improved Data Stream Summary: The Count-Min Sketch and Its Applications , 2004, LATIN.

[15]  Adam Meyerson,et al.  Online facility location , 2001, Proceedings 2001 IEEE International Conference on Cluster Computing.

[16]  Timothy M. Chan Faster core-set constructions and data-stream algorithms in fixed dimensions , 2006, Comput. Geom..

[17]  Sariel Har-Peled,et al.  Smaller Coresets for k-Median and k-Means Clustering , 2005, SCG.

[18]  Rajeev Motwani,et al.  Incremental Clustering and Dynamic Information Retrieval , 2004, SIAM J. Comput..

[19]  R. Prim Shortest connection networks and some generalizations , 1957 .

[20]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[21]  Piotr Indyk,et al.  Algorithms for dynamic geometric problems over data streams , 2004, STOC '04.

[22]  Harold N. Gabow,et al.  Data structures for weighted matching and nearest common ancestors with linking , 1990, SODA '90.

[23]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[24]  Subhash Suri,et al.  Adaptive sampling for geometric problems over data streams , 2004, PODS.

[25]  Claire Mathieu,et al.  A Randomized Approximation Scheme for Metric MAX-CUT , 1998, FOCS.

[26]  Sariel Har-Peled,et al.  On coresets for k-means and k-median clustering , 2004, STOC '04.

[27]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[28]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[29]  Timothy M. Chan,et al.  Geometric Optimization Problems over Sliding Windows , 2006, Int. J. Comput. Geom. Appl..

[30]  R. Motwani,et al.  High-Dimensional Computational Geometry , 2000 .

[31]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.