Finding frequent items in data streams

We present a 1-pass algorithm for estimating the most frequent items in a data stream using limited storage space. Our method relies on a data structure called a COUNT SKETCH, which allows us to reliably estimate the frequencies of frequent items in the stream. Our algorithm achieves better space bounds than the previously known best algorithms for this problem for several natural distributions on the item frequencies. In addition, our algorithm leads directly to a 2-pass algorithm for the problem of estimating the items with the largest (absolute) change in frequency between two data streams. To our knowledge, this latter problem has not been previously studied in the literature. c © 2003 Elsevier B.V. All rights reserved.

[1]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[2]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.

[3]  M. Crovella,et al.  Heavy-tailed probability distributions in the World Wide Web , 1998 .

[4]  H. Garcia-Molina,et al.  Computing Iceberg Queries E ciently , 1998 .

[5]  Prabhakar Raghavan,et al.  Computing on data streams , 1999, External Memory Algorithms.

[6]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[7]  Yossi Matias,et al.  DIMACS Series in Discrete Mathematicsand Theoretical Computer Science Synopsis Data Structures for Massive Data , 2007 .

[8]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.

[9]  Mahesh Viswanathan,et al.  Testing and Spot-Checking of Data Streams , 2000, SODA '00.

[10]  Dimitris Achlioptas,et al.  Database-friendly random projections , 2001, PODS.

[11]  Jessica H. Fong,et al.  An Approximate Lp Difference Algorithm for Massive Data Streams , 1999, Discret. Math. Theor. Comput. Sci..

[12]  Mahesh Viswanathan,et al.  An Approximate L1-Difference Algorithm for Massive Data Streams , 2002, SIAM J. Comput..

[13]  Sudipto Guha,et al.  Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.

[14]  Yinglian Xie,et al.  Locality in search engine queries and its implications for caching , 2002, Proceedings.Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies.

[15]  Michael E. Saks,et al.  Space lower bounds for distance approximation in the data stream model , 2002, STOC '02.

[16]  Richard M. Karp,et al.  A simple algorithm for finding frequent elements in streams and bags , 2003, TODS.