On the randomness that generates biased samples: The limited randomness approach

We introduce two new algorithms for creating an exponentially biased sample over a possibly infinite data stream. Such an algorithm exists in the literature and uses O(logn) random bits per stream element, where n is the number of elements in the sample. In this paper we present algorithms that use O(1) random bits per stream element. In essence, what we achieve is to be able to choose an element at random, out of n elements, by sparing O(1) random bits. Although in general this is not possible, the exact problem we are studying makes it possible. The needed randomness for this task is provided through a random walk. To prove the correctness of our algorithms we use a model also introduced in this paper, the limited randomness model. It is based on the fact that survival probabilities are assigned to the stream elements before they start to arrive.

[1]  Philip S. Yu,et al.  A Survey of Synopsis Construction in Data Streams , 2007, Data Streams - Models and Algorithms.

[2]  Helen J. Wang,et al.  Online aggregation , 1997, SIGMOD '97.

[3]  Jennifer Widom,et al.  Continuous queries over data streams , 2001, SGMD.

[4]  Bruce G. Lindsay,et al.  Random sampling techniques for space efficient online computation of order statistics of large datasets , 1999, SIGMOD '99.

[5]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[6]  Oded Goldreich Another Motivation for Reducing the Randomness Complexity of Algorithms , 2011, Studies in Complexity and Cryptography.

[7]  Jeffrey Scott Vitter,et al.  Wavelet-based histograms for selectivity estimation , 1998, SIGMOD '98.

[8]  Divyakant Agrawal,et al.  Applying the golden rule of sampling for query estimation , 2001, SIGMOD '01.

[9]  Charu C. Aggarwal,et al.  On biased reservoir sampling in the presence of stream evolution , 2006, VLDB.

[10]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[11]  R. Graham,et al.  On sampling with Markov chains , 1996 .

[12]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[13]  Ravi Montenegro Intersection Conductance and Canonical Alternating Paths: Methods for General Finite Markov Chains , 2014, Comb. Probab. Comput..

[14]  A. Sokal,et al.  Bounds on the ² spectrum for Markov chains and Markov processes: a generalization of Cheeger’s inequality , 1988 .

[15]  Mark Jerrum,et al.  Conductance and the rapid mixing property for Markov chains: the approximation of permanent resolved , 1988, STOC '88.

[16]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.