Approximate Frequency Counts over Data Streams

Publisher Summary This chapter presents algorithms for computing frequency counts exceeding a user-specified threshold over data streams. The algorithms are simple and have provably small memory footprints. Although the output is approximate, the error is guaranteed not to exceed a user-specified parameter. The algorithms can easily be deployed for streams of singleton items like those found in IP network monitoring. In several emerging applications, data takes the form of continuous data streams, as opposed to finite stored datasets. Examples include stock tickers, network traffic measurements, Web-server logs, click streams, data feeds from sensor networks, and telecom call records. Stream processing differs from computation over traditional stored datasets in two important aspects: (a) the sheer volume of a stream over its lifetime could be huge, and (b) queries require timely answers; response times should be small. Therefore, it is not possible to store the stream in its entirety on secondary storage and scan it when a query arrives.

[1]  Raghu Ramakrishnan,et al.  Bottom-up computation of sparse and Iceberg CUBE , 1999, SIGMOD '99.

[2]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.

[3]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[4]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[5]  Bruce G. Lindsay,et al.  Random sampling techniques for space efficient online computation of order statistics of large datasets , 1999, SIGMOD '99.

[6]  Jeffrey Scott Vitter,et al.  Dynamic Maintenance of Wavelet-Based Histograms , 2000, VLDB.

[7]  Rajeev Motwani,et al.  Randomized algorithms , 1996, CSUR.

[8]  Philip S. Yu,et al.  An effective hash-based algorithm for mining association rules , 1995, SIGMOD '95.

[9]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[10]  Cristian Estan,et al.  New directions in traffic measurement and accounting , 2001, IMW '01.

[11]  Christian Hidber,et al.  Association Rule Mining , 2017 .

[12]  Jian Pei,et al.  Mining frequent patterns without candidate generation , 2000, SIGMOD 2000.

[13]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[14]  Divesh Srivastava,et al.  On computing correlated aggregates over continual data streams , 2001, SIGMOD '01.

[15]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[16]  Sriram Raghavan,et al.  WebBase: a repository of Web pages , 2000, Comput. Networks.

[17]  Viswanath Poosala,et al.  Aqua: A Fast Decision Support Systems Using Approximate Query Answers , 1999, VLDB.

[18]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[19]  Gerard Salton,et al.  Term-Weighting Approaches in Automatic Text Retrieval , 1988, Inf. Process. Manag..

[20]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[21]  Kyu-Young Whang,et al.  A linear-time probabilistic counting algorithm for database applications , 1990, TODS.

[22]  Shamkant B. Navathe,et al.  An Efficient Algorithm for Mining Association Rules in Large Databases , 1995, VLDB.

[23]  Sudipto Guha,et al.  Data-streams and histograms , 2001, STOC '01.

[24]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[25]  Jian Pei,et al.  Efficient computation of Iceberg cubes with complex measures , 2001, SIGMOD '01.

[26]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.