A unifying framework for ℓ0-sampling algorithms

The problem of building an ℓ0-sampler is to sample near-uniformly from the support set of a dynamic multiset. This problem has a variety of applications within data analysis, computational geometry and graph algorithms. In this paper, we abstract a set of steps for building an ℓ0-sampler, based on sampling, recovery and selection. We analyze the implementation of an ℓ0-sampler within this framework, and show how prior constructions of ℓ0-samplers can all be expressed in terms of these steps. Our experimental contribution is to provide a first detailed study of the accuracy and computational cost of ℓ0-samplers.

[1]  Mikkel Thorup,et al.  The power of simple tabulation hashing , 2010, STOC.

[2]  Sumit Ganguly,et al.  Counting distinct items over update streams , 2005, Theor. Comput. Sci..

[3]  Divesh Srivastava,et al.  Holistic UDAFs at streaming speeds , 2004, SIGMOD '04.

[4]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[5]  David P. Woodruff,et al.  An optimal algorithm for the distinct elements problem , 2010, PODS '10.

[6]  Sudipto Guha,et al.  Analyzing graph structure via linear measurements , 2012, SODA.

[7]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[8]  Piotr Indyk,et al.  Approximate nearest neighbors: towards removing the curse of dimensionality , 1998, STOC '98.

[9]  Graham Cormode,et al.  On Unifying the Space of ℓ0-Sampling Algorithms , 2013, ALENEX.

[10]  Aravind Srinivasan,et al.  Chernoff-Hoeffding bounds for applications with limited independence , 1995, SODA '93.

[11]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[12]  Eric Price,et al.  Efficient sketches for the set query problem , 2010, SODA '11.

[13]  Themis Palpanas,et al.  Frequent items in streaming data: An experimental evaluation of the state-of-the-art , 2009, Data Knowl. Eng..

[14]  Anupam Gupta,et al.  An elementary proof of the Johnson-Lindenstrauss Lemma , 1999 .

[15]  Piotr Indyk,et al.  Sampling in dynamic data streams and applications , 2005, Int. J. Comput. Geom. Appl..

[16]  David Eppstein,et al.  Space-Efficient Straggler Identification in Round-Trip Data Streams Via Newton's Identities and Invertible Bloom Filters , 2007, WADS.

[17]  Amr El Abbadi,et al.  Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic , 2008, EDBT '08.

[18]  Jeffrey D. Ullman,et al.  Principles of Database Systems , 1980 .

[19]  Piotr Indyk,et al.  Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality , 2012, Theory Comput..

[20]  Peter J. Haas,et al.  Distinct-value synopses for multiset operations , 2009, CACM.

[21]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[22]  Graham Cormode,et al.  Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling , 2005, VLDB.

[23]  Hossein Jowhari,et al.  Tight bounds for Lp samplers, finding duplicates in streams, and related problems , 2010, PODS.

[24]  Piotr Indyk,et al.  A small approximately min-wise independent family of hash functions , 1999, SODA '99.

[25]  R. Vershynin,et al.  One sketch for all: fast algorithms for compressed sensing , 2007, STOC '07.

[26]  Noam Nisan,et al.  Pseudorandom generators for space-bounded computations , 1990, STOC '90.

[27]  Ely Porat,et al.  Feasible Sampling of Non-strict Turnstile Data Streams , 2012, ArXiv.

[28]  David P. Woodruff,et al.  1-pass relative-error Lp-sampling with applications , 2010, SODA '10.

[29]  Peter J. Haas,et al.  Synopses for Massive Data , 2012 .