Finding frequent items in sliding windows with multinomially-distributed item frequencies

In this paper, we present an algorithm for identifying frequently occurring items within a sliding window of the last N items seen over an infinite data stream, given the following constraints: (1) the relative frequencies of the item types can vary over the lifetime of the stream, provided that they vary sufficiently slowly that for any sliding window of N tuples, with high probability the window could have been generated by a multinomial distribution. We refer to this as the drifting distribution model in the full version of this paper (Golab et al., 2004). (2) The entire sliding window does not fit in the available memory (otherwise, we could simply count all the distinct item types and return those whose frequencies exceed some threshold). (3) The stream may arrive at a high rate, so only a constant number of operations (amortized) is allowed for the processing of each item.

[1]  Erik D. Demaine,et al.  Identifying frequent items in sliding windows over on-line packet streams , 2003, IMC '03.

[2]  Lukasz Golab,et al.  Issues in data stream management , 2003, SGMD.

[3]  Wen-Chi Hou,et al.  Proceedings of the 11th International Conference on Scientific and Statistical Database Management , 1999 .

[4]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[5]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[6]  George Varghese,et al.  New directions in traffic measurement and accounting , 2002, CCRV.

[7]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[8]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[9]  W. Hoeffding Probability Inequalities for sums of Bounded Random Variables , 1963 .

[10]  Divyakant Agrawal,et al.  Supporting sliding window queries for continuous data streams , 2003, 15th International Conference on Scientific and Statistical Database Management, 2003..

[11]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[12]  Hamid Pirahesh,et al.  Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals , 1996, Data Mining and Knowledge Discovery.

[13]  George Varghese,et al.  New directions in traffic measurement and accounting , 2002, SIGCOMM '02.

[14]  Erik D. Demaine,et al.  Frequency Estimation of Internet Packet Streams with Limited Space , 2002, ESA.

[15]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, PODS '03.