A stratified sampling algorithm for landmark windows over data streams

In many applications, data does not take the form of traditional stored relations, but rather arrives in continuous, rapid, time-varying data streams,and data streams are potentially unbounded in size. Focusing on the problem of sampling from landmark windows over data streams, a new concept, which is called stratified sampling ratio function, is presented. Then a multistage stratified sampling algorithm for landmark window model is introduced. In the algorithm, a dynamic candidate sample set is maintained. When an arrived tuple is determined to enter the sample set and to be deleted from the sample, the arrival time of data items is considered, and the probability for selecting to enter and remain in the sample set of more recent arrived tuples is greater than that of older ones. The theoretic analysis and experiments show that the algorithm is effective and efficient for continuous data streams processing.

[1]  Philip S. Yu,et al.  A Survey of Synopsis Construction in Data Streams , 2007, Data Streams - Models and Algorithms.

[2]  Paul G. Spirakis,et al.  Weighted random sampling with a reservoir , 2006, Inf. Process. Lett..

[3]  Doron Rotem,et al.  Random sampling from databases: a survey , 1995 .

[4]  Chris Jermaine,et al.  Online maintenance of very large random samples , 2004, SIGMOD '04.

[5]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[6]  Xiaoyang Sean Wang,et al.  Adaptive-Size Reservoir Sampling over Data Streams , 2007, 19th International Conference on Scientific and Statistical Database Management (SSDBM 2007).

[7]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[8]  Charu C. Aggarwal,et al.  On biased reservoir sampling in the presence of stream evolution , 2006, VLDB.