On Unifying the Space of ℓ0-Sampling Algorithms

The problem of building an e0-sampler is to sample near-uniformly from the support set of a dynamic multiset. This problem has a variety of applications within data analysis, computational geometry and graph algorithms. In this paper, we abstract a set of steps for building an e0-sampler, based on sampling, recovery and selection. We analyze the implementation of an e0-sampler within this framework, and show how prior constructions of e0-samplers can all be expressed in terms of these steps. Our experimental contribution is to provide a first detailed study of the accuracy and computational cost of e0-samplers.

[1]  David P. Woodruff,et al.  An optimal algorithm for the distinct elements problem , 2010, PODS '10.

[2]  David P. Woodruff,et al.  1-pass relative-error Lp-sampling with applications , 2010, SODA '10.

[3]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[4]  Graham Cormode,et al.  Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling , 2005, VLDB.

[5]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[6]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[7]  Eric Price,et al.  Efficient sketches for the set query problem , 2010, SODA '11.

[8]  Sumit Ganguly,et al.  Counting distinct items over update streams , 2005, Theor. Comput. Sci..

[9]  Divesh Srivastava,et al.  Holistic UDAFs at streaming speeds , 2004, SIGMOD '04.

[10]  David Eppstein,et al.  Space-Efficient Straggler Identification in Round-Trip Data Streams Via Newton's Identities and Invertible Bloom Filters , 2007, WADS.

[11]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[12]  Hossein Jowhari,et al.  Tight bounds for Lp samplers, finding duplicates in streams, and related problems , 2010, PODS.

[13]  Piotr Indyk,et al.  A small approximately min-wise independent family of hash functions , 1999, SODA '99.

[14]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[15]  Peter J. Haas,et al.  Distinct-value synopses for multiset operations , 2009, CACM.

[16]  Aravind Srinivasan,et al.  Chernoff-Hoeffding bounds for applications with limited independence , 1995, SODA '93.

[17]  Anupam Gupta,et al.  An elementary proof of the Johnson-Lindenstrauss Lemma , 1999 .

[18]  Piotr Indyk,et al.  Sampling in dynamic data streams and applications , 2005, Int. J. Comput. Geom. Appl..

[19]  Themis Palpanas,et al.  Frequent items in streaming data: An experimental evaluation of the state-of-the-art , 2009, Data Knowl. Eng..

[20]  Amr El Abbadi,et al.  Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic , 2008, EDBT '08.

[21]  R. Vershynin,et al.  One sketch for all: fast algorithms for compressed sensing , 2007, STOC '07.

[22]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[23]  Sudipto Guha,et al.  Analyzing graph structure via linear measurements , 2012, SODA.