An Improved Data Stream Summary: The Count-Min Sketch and Its Applications

We introduce a new sublinear space data structure—the Count-Min Sketch— for summarizing data streams. Our sketch allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly; in addition, it can be applied to solve several important problems in data streams such as finding quantiles, frequent items, etc. The time and space bounds we show for using the CM sketch to solve these problems significantly improve those previously known — typically from 1/e 2 to 1/e in factor.

[1]  Sudipto Guha,et al.  Dynamic multidimensional histograms , 2002, SIGMOD '02.

[2]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[3]  Srikanta Tirthapura,et al.  Estimating simple functions on the union of data streams , 2001, SPAA '01.

[4]  David P. Woodruff Optimal space lower bounds for all frequency moments , 2004, SODA '04.

[5]  Yossi Matias,et al.  DIMACS Series in Discrete Mathematicsand Theoretical Computer Science Synopsis Data Structures for Massive Data , 2007 .

[6]  Rajeev Motwani,et al.  Randomized algorithms , 1996, CSUR.

[7]  Noga Alon,et al.  Tracking join and self-join sizes in limited storage , 1999, PODS '99.

[8]  Sudipto Guha,et al.  Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.

[9]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[10]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[11]  Graham Cormode,et al.  What's new: finding significant differences in network data streams , 2004, INFOCOM 2004.

[12]  Bruce G. Lindsay,et al.  Approximate medians and other quantiles in one pass and with limited memory , 1998, SIGMOD '98.

[13]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, TODS.

[14]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[15]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[16]  Graham Cormode,et al.  Estimating Dominance Norms of Multiple Data Streams , 2003, ESA.

[17]  Divesh Srivastava,et al.  Finding Hierarchical Heavy Hitters in Data Streams , 2003, VLDB.

[18]  Johannes Gehrke,et al.  Querying and mining data streams: you only get one look a tutorial , 2002, SIGMOD '02.

[19]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[20]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[21]  Jessica H. Fong,et al.  An Approximate Lp Difference Algorithm for Massive Data Streams , 1999, Discret. Math. Theor. Comput. Sci..

[22]  S. Muthukrishnan,et al.  How to Summarize the Universe: Dynamic Maintenance of Quantiles , 2002, VLDB.

[23]  Rajeev Rastogi,et al.  Processing complex aggregate queries over data streams , 2002, SIGMOD '02.

[24]  Cristian Estan,et al.  New directions in traffic measurement and accounting , 2001, IMW '01.

[25]  Philippe Flajolet,et al.  Probabilistic counting , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[26]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.