Optimal sampling from sliding windows

A sliding windows model is an important case of the streaming model, where only the most "recent" elements remain active and the rest are discarded in a stream. The sliding windows model is important for many applications (see, e.g., Babcock, Babu, Datar, Motwani and Widom (PODS 02); and Datar, Gionis, Indyk and Motwani (SODA 02)). There are two equally important types of the sliding windows model -- windows with fixed size, (e.g., where items arrive one at a time, and only the most recent n items remain active for some fixed parameter n), and bursty windows (e.g., where many items can arrive in "bursts" at a single step and where only items from the last t steps remain active, again for some fixed parameter t). Random sampling is a fundamental tool for data streams, as numerous algorithms operate on the sampled data instead of on the entire stream. Effective sampling from sliding windows is a nontrivial problem, as elements eventually expire. In fact, the deletions are implicit; i.e., it is not possible to identify deleted elements without storing the entire window. The implicit nature of deletions on sliding windows does not allow the existing methods (even those that support explicit deletions, e.g., Cormode, Muthukrishnan and Rozenbaum (VLDB 05); Frahling, Indyk and Sohler (SOCG 05)) to be directly "translated" to the sliding windows model. One trivial approach to overcoming the problem of implicit deletions is that of over-sampling. When k samples are required, the over-sampling method maintains k'>k samples in the hope that at least k samples are not expired. The obvious disadvantages of this method are twofold: (a) It introduces additional costs and thus decreases the performance; and (b) The memory bounds are not deterministic, which is atypical for streaming algorithms (where even small probability events may eventually happen for a stream that is big enough). Babcock, Datar and Motwani (SODA 02), were the first to stress the importance of improvements to over-sampling. They formally introduced the problem of sampling from sliding windows and improved the over-sampling method for sampling with replacement. Their elegant solutions for sampling with replacement are optimal in expectation, and thus resolve disadvantage (a) mentioned above. Unfortunately, the randomized bounds do not resolve disadvantage (b) above. Interestingly, all algorithms that employ the ideas of Babcock, Datar and Motwani have the same central problem of having to deal with randomized complexity (see e.g., Datar and Muthukrishnan (ESA 02); Chakrabarti, Cormode and McGregor (SODA 07)). Further, the proposed solutions of Babcock, Datar and Motwani for sampling without replacement are based on the criticized over-sampling method and thus do not solve problem (a). Therefore, the question of whether we can solve sampling on sliding windows optimally (i.e., resolving both disadvantages) is implicit in the paper of Babcock, Datar and Motwani and has remained open for all variants of the problem. In this paper we answer these questions affirmatively and provide optimal sampling schemas for all variants of the problem, i.e., sampling with or without replacement from fixed or bursty windows. Specifically, for fixed-size windows, we provide optimal solutions that require O(k) memory; for bursty windows, we show algorithms that require O(klogn), which is optimal since it matches the lower bound by Gemulla and Lehner (SIGMOD 08). In contrast to the work of of Babcock, Datar and Motwani, our solutions have deterministic bounds. Thus, we prove a perhaps somewhat surprising fact: the memory complexity of the sampling-based algorithm for all variants of the sliding windows model is comparable with that of streaming models (i.e., without the sliding windows). This is the first result of this type, since all previous "translations" of sampling-based algorithms to sliding windows incur randomized memory guarantees only.

[1]  Srikanta Tirthapura,et al.  Distributed Streams Algorithms for Sliding Windows , 2002, SPAA '02.

[2]  Mohammad Ghodsi,et al.  New Streaming Algorithms for Counting Triangles in Graphs , 2005, COCOON.

[3]  Sumit Ganguly,et al.  Simpler algorithm for estimating frequency moments of data streams , 2006, SODA '06.

[4]  Subhash Khot,et al.  Near-optimal lower bounds on the multi-party communication complexity of set disjointness , 2003, 18th IEEE Annual Conference on Computational Complexity, 2003. Proceedings..

[5]  Piotr Indyk,et al.  Maintaining stream statistics over sliding windows: (extended abstract) , 2002, SODA '02.

[6]  Christian Sohler,et al.  Counting triangles in data streams , 2006, PODS.

[7]  Sudipto Guha,et al.  Streaming and sublinear approximation of entropy and information distances , 2005, SODA '06.

[8]  Graham Cormode,et al.  Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling , 2005, VLDB.

[9]  Edith Cohen,et al.  Size-Estimation Framework with Applications to Transitive Closure and Reachability , 1997, J. Comput. Syst. Sci..

[10]  Edith Cohen,et al.  Summarizing data using bottom-k sketches , 2007, PODC '07.

[11]  Graham Cormode,et al.  A near-optimal algorithm for computing the entropy of a stream , 2007, SODA '07.

[12]  Noga Alon,et al.  Estimating arbitrary subset sums with few probes , 2005, PODS '05.

[13]  Lap-Kei Lee,et al.  A simpler and more efficient deterministic scheme for finding frequent items over sliding windows , 2006, PODS '06.

[14]  Ravi Kumar,et al.  Sampling algorithms: lower bounds and applications , 2001, STOC '01.

[15]  Lap-Kei Lee,et al.  Maintaining significant stream statistics over sliding windows , 2006, SODA '06.

[16]  Wolfgang Lehner,et al.  Sampling time-based sliding windows in bounded space , 2008, SIGMOD Conference.

[17]  David Maier,et al.  Semantics and evaluation techniques for window aggregates in data streams , 2005, SIGMOD '05.

[18]  Joan Feigenbaum,et al.  Graph distances in the streaming model: the value of space , 2005, SODA '05.

[19]  Joan Feigenbaum,et al.  Computing Diameter in the Streaming and Sliding-Window Models , 2002, Algorithmica.

[20]  Ashwin Lall,et al.  A data streaming algorithm for estimating entropies of od flows , 2007, IMC '07.

[21]  S. Muthukrishnan,et al.  Estimating Rarity and Similarity over Data Stream Windows , 2002, ESA.

[22]  Rajeev Motwani,et al.  Maintaining variance and k-medians over data stream windows , 2003, PODS.

[23]  Charu C. Aggarwal,et al.  On biased reservoir sampling in the presence of stream evolution , 2006, VLDB.

[24]  Philip S. Yu,et al.  Moment: maintaining closed frequent itemsets over a stream sliding window , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[25]  S. Muthukrishnan,et al.  Estimating Entropy and Entropy Norm on Data Streams , 2006, Internet Math..

[26]  Vern Paxson,et al.  Framework for IP Performance Metrics , 1998, RFC.

[27]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[28]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[29]  Kim-Hung Li,et al.  Reservoir-sampling algorithms of time complexity O(n(1 + log(N/n))) , 1994, TOMS.

[30]  Sudipto Guha,et al.  Clustering Data Streams: Theory and Practice , 2003, IEEE Trans. Knowl. Data Eng..

[31]  Piotr Indyk,et al.  Sampling in dynamic data streams and applications , 2005, Int. J. Comput. Geom. Appl..

[32]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[33]  Erik D. Demaine,et al.  Identifying frequent items in sliding windows over on-line packet streams , 2003, IMC '03.

[34]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[35]  David Maier,et al.  No pane, no gain: efficient evaluation of sliding-window aggregates over data streams , 2005, SGMD.

[36]  Rajeev Motwani,et al.  Load shedding for aggregation queries over data streams , 2004, Proceedings. 20th International Conference on Data Engineering.

[37]  Rainer Gemulla,et al.  Sampling algorithms for evolving datasets , 2008 .

[38]  Sumit Ganguly,et al.  Estimating Frequency Moments of Data Streams Using Random Linear Combinations , 2004, APPROX-RANDOM.

[39]  Mahesh Viswanathan,et al.  Testing and Spot-Checking of Data Streams , 2000, SODA '00.

[40]  R. Ostrovsky,et al.  Smooth Histograms for Sliding Windows , 2007, FOCS 2007.

[41]  Christos Faloutsos,et al.  Density biased sampling: an improved method for data mining and clustering , 2000, SIGMOD '00.

[42]  Lukasz Golab,et al.  Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams , 2003, VLDB.

[43]  Carsten Lund,et al.  Flow sampling under hard resource constraints , 2004, SIGMETRICS '04/Performance '04.

[44]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[45]  Cecilia M. Procopiuc,et al.  Density Estimation for Spatial Data Streams , 2005, SSTD.

[46]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[47]  Anirban Dasgupta,et al.  Sampling algorithms and coresets for ℓp regression , 2007, SODA '08.

[48]  Ravi Kannan,et al.  The space complexity of pass-efficient algorithms for clustering , 2006, SODA '06.

[49]  Charu C. Aggarwal,et al.  Data Streams: Models and Algorithms (Advances in Database Systems) , 2006 .

[50]  Ravi Kumar,et al.  An information statistics approach to data stream and communication complexity , 2004, J. Comput. Syst. Sci..

[51]  David P. Woodruff,et al.  Optimal approximations of the frequency moments of data streams , 2005, STOC '05.

[52]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[53]  Joan Feigenbaum,et al.  On graph problems in a semi-streaming model , 2005, Theor. Comput. Sci..

[54]  Abhinandan Das,et al.  Semantic approximation of data stream joins , 2005, IEEE Transactions on Knowledge and Data Engineering.

[55]  Yong Wang,et al.  RANDOM SAMPLING ALGORITHMS FOR SLIDING WINDOWS OVER DATA STREAMS , 2005 .

[56]  Data Streams : Algorithms and Applications 1 , 2022 .

[57]  Jeffrey F. Naughton,et al.  Static optimization of conjunctive queries with sliding windows over infinite streams , 2004, SIGMOD '04.

[58]  Krzysztof Onak,et al.  Sketching and Streaming Entropy via Approximation Theory , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[59]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[60]  Stanley B. Zdonik,et al.  Window-aware load shedding for aggregation queries over data streams , 2006, VLDB.

[61]  Kamalika Chaudhuri,et al.  When Random Sampling Preserves Privacy , 2006, CRYPTO.

[62]  Sumit Ganguly,et al.  Counting distinct items over update streams , 2005, Theor. Comput. Sci..

[63]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[64]  Mario Szegedy,et al.  The DLT priority sampling is essentially optimal , 2006, STOC '06.

[65]  Ravi Kumar,et al.  An improved data stream algorithm for frequency moments , 2004, SODA '04.

[66]  Rajeev Motwani,et al.  Incremental Clustering and Dynamic Information Retrieval , 2004, SIAM J. Comput..

[67]  Subhash Suri,et al.  Catching elephants with mice: Sparse sampling for monitoring sensor networks , 2009, TOSN.

[68]  Ziv Bar-Yossef,et al.  Reductions in streaming algorithms, with an application to counting triangles in graphs , 2002, SODA '02.

[69]  Michael Kolonko,et al.  Sequential reservoir sampling with a nonuniform distribution , 2006, TOMS.

[70]  Rajeev Motwani,et al.  Operator scheduling in data stream systems , 2004, VLDB 2004.

[71]  Ziv Bar-Yossef,et al.  Sampling lower bounds via information theory , 2003, STOC '03.