Sampling streaming data with replacement

Simple random sampling is a widely accepted basis for estimation from a population. When data come as a stream, the total population size continuously grows and only one pass through the data is possible. Reservoir sampling is a method of maintaining a fixed size random sample from streaming data. Reservoir sampling without replacement has been extensively studied and several algorithms with sub-linear time complexity exist. Although reservoir sampling with replacement is previously mentioned by some authors, it has been studied very little and only linear algorithms exist. A with-replacement reservoir sampling algorithm of sub-linear time complexity is introduced. A thorough complexity analysis of several approaches to the with-replacement reservoir sampling problem is also provided.

[1]  W. Feller,et al.  An Introduction to Probability Theory and Its Application. , 1951 .

[2]  Kim-Hung Li,et al.  Reservoir-sampling algorithms of time complexity O(n(1 + log(N/n))) , 1994, TOMS.

[3]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[4]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[5]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[6]  Donald Ervin Knuth,et al.  The Art of Computer Programming , 1968 .

[7]  George Ostrouchov,et al.  Large data series: Modeling the usual to identify the unusual , 1997 .

[8]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[9]  David J. Marchette,et al.  On Some Techniques for Streaming Data: A Case Study of Internet Packet Headers , 2003 .

[10]  Michael Kolonko,et al.  Sequential reservoir sampling with a nonuniform distribution , 2006, TOMS.

[11]  A. I. McLeod,et al.  A Convenient Algorithm for Drawing a Simple Random Sample , 1983 .

[12]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[13]  William Feller,et al.  An Introduction to Probability Theory and Its Applications , 1951 .

[14]  Li Wei,et al.  M-kernel merging: towards density estimation over data streams , 2003, Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings..

[15]  William H. Press,et al.  Numerical recipes in C , 2002 .

[16]  Mervin E. Muller,et al.  Development of Sampling Plans by Using Sequential (Item by Item) Selection Techniques and Digital Computers , 1962 .