Summarizing and Mining Skewed Data Streams

Many applications generate massive data streams. Summarizing such massive data requires fast, small space algorithms to support post-hoc queries and mining. An important observation is that such streams are rarely uniform, and real data sources typically exhibit significant skewness. These are well modeled by Zipf distributions, which are characterized by a parameter, z, that captures the amount of skew. We present a data stream summary that can answer point queries with e accuracy and show that the space needed is only O(e−min{1,1/z}). This is the first o(1/e) space algorithm for this problem, and we show it is essentially tight for skewed distributions. We show that the same data structure can also estimate the L2 norm of the stream in o(1/e) space for z > 12 , another improvement over the existing Ω(1/e) methods. We support our theoretical results with an experimental study over a large variety of real and synthetic data. We show that significant skew is present in both textual and telecommunication data. Our methods give strong accuracy, significantly better than other methods, and behave exactly in line with their analytic bounds.

[1]  Yuen Ren Chao,et al.  Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology , 1950 .

[2]  W. B. Johnson,et al.  Extensions of Lipschitz mappings into Hilbert space , 1984 .

[3]  Vern Paxson,et al.  Empirically derived analytic models of wide-area TCP connections , 1994, TNET.

[4]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[5]  M. Crovella,et al.  Heavy-tailed probability distributions in the World Wide Web , 1998 .

[6]  Prabhakar Raghavan,et al.  Computing on data streams , 1999, External Memory Algorithms.

[7]  Huberman,et al.  Strong regularities in world wide web surfing , 1998, Science.

[8]  S. Redner How popular is your paper? An empirical study of the citation distribution , 1998, cond-mat/9804163.

[9]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[10]  Noga Alon,et al.  Tracking join and self-join sizes in limited storage , 1999, PODS '99.

[11]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[12]  Jessica H. Fong,et al.  An Approximate Lp Difference Algorithm for Massive Data Streams , 1999, Discret. Math. Theor. Comput. Sci..

[13]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[14]  Ziv Bar-Yossef,et al.  Reductions in streaming algorithms, with an application to counting triangles in graphs , 2002, SODA '02.

[15]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[16]  Rajeev Rastogi,et al.  Processing complex aggregate queries over data streams , 2002, SIGMOD '02.

[17]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[18]  Samuel Madden,et al.  Fjording the stream: an architecture for queries over streaming sensor data , 2002, Proceedings 18th International Conference on Data Engineering.

[19]  Mahesh Viswanathan,et al.  An Approximate L1-Difference Algorithm for Massive Data Streams , 2002, SIAM J. Comput..

[20]  Johannes Gehrke,et al.  Querying and mining data streams: you only get one look a tutorial , 2002, SIGMOD '02.

[21]  S. Muthukrishnan,et al.  How to Summarize the Universe: Dynamic Maintenance of Quantiles , 2002, VLDB.

[22]  Sudipto Guha,et al.  Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.

[23]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[24]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[25]  Ravi Kumar,et al.  On the Bursty Evolution of Blogspace , 2003, WWW '03.

[26]  Michael Stonebraker,et al.  Aurora: a data stream management system , 2003, SIGMOD '03.

[27]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, PODS '03.

[28]  Nick Koudas,et al.  Data stream query processing , 2003, Proceedings of the Fourth International Conference on Web Information Systems Engineering, 2003. WISE 2003..

[29]  Theodore Johnson,et al.  Gigascope: a stream database for network applications , 2003, SIGMOD '03.

[30]  Divesh Srivastava,et al.  Data Stream Query Processing: A Tutorial , 2003, VLDB.

[31]  Lukasz Golab,et al.  Issues in data stream management , 2003, SGMD.

[32]  Divesh Srivastava,et al.  Holistic UDAFs at streaming speeds , 2004, SIGMOD '04.

[33]  Jiawei Han,et al.  MAIDS: mining alarming incidents from data streams , 2004, SIGMOD '04.

[34]  David P. Woodruff Optimal space lower bounds for all frequency moments , 2004, SODA '04.

[35]  Mikkel Thorup,et al.  Tabulation based 4-universal hashing with applications to second moment estimation , 2004, SODA '04.

[36]  Philip S. Yu,et al.  Mining Extremely Skewed Trading Anomalies , 2004, EDBT.

[37]  Malik Magdon-Ismail,et al.  Discovering Hidden Groups in Communication Networks , 2004, ISI.

[38]  Graham Cormode,et al.  What's new: finding significant differences in network data streams , 2004, IEEE/ACM Transactions on Networking.

[39]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[40]  Eddie Kohler,et al.  Observed Structure of Addresses in IP Traffic , 2002, IEEE/ACM Transactions on Networking.