Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling

Emerging data stream management systems approach the challenge of massive data distributions which arrive at high speeds while there is only small storage by summarizing and mining the distributions using samples or sketches. However, data distributions can be "viewed" in different ways. A data stream of integer values can be viewed either as the forward distribution f (x), ie., the number of occurrences of x in the stream, or as its inverse, f-1 (i), which is the number of items that appear i times. While both such "views" are equivalent in stored data systems, over data streams that entail approximations, they may be significantly different. In other words, samples and sketches developed for the forward distribution may be ineffective for summarizing or mining the inverse distribution. Yet, many applications such as IP traffic monitoring naturally rely on mining inverse distributions.We formalize the problems of managing and mining inverse distributions and show provable differences between summarizing the forward distribution vs the inverse distribution. We present methods for summarizing and mining inverse distributions of data streams: they rely on a novel technique to maintain a dynamic sample over the stream with provable guarantees which can be used for variety of summarization tasks (building quantiles or equidepth histograms) and mining (anomaly detection: finding heavy hitters, and measuring the number of rare items), all with provable guarantees on quality of approximations and time/space used by our streaming methods.We also complement our analytical and algorithmic results by presenting an experimental study of the methods over network data streams.

[1]  Srikanta Tirthapura,et al.  Estimating simple functions on the union of data streams , 2001, SPAA '01.

[2]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[3]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[4]  Rajeev Motwani,et al.  Randomized algorithms , 1996, CSUR.

[5]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[6]  Michael Stonebraker,et al.  Aurora: a data stream management system , 2003, SIGMOD '03.

[7]  Yossi Matias,et al.  Fast incremental maintenance of approximate histograms , 1997, TODS.

[8]  Rajeev Rastogi,et al.  Processing set expressions over continuous update streams , 2003, SIGMOD '03.

[9]  Yossi Matias,et al.  New sampling-based summary statistics for improving approximate query answers , 1998, SIGMOD '98.

[10]  George Varghese,et al.  New directions in traffic measurement and accounting , 2002, CCRV.

[11]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[12]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[13]  Piotr Indyk,et al.  Maintaining stream statistics over sliding windows: (extended abstract) , 2002, SODA '02.

[14]  Christian Sohler,et al.  Coresets in dynamic geometric data streams , 2005, STOC '05.

[15]  Johannes Gehrke,et al.  Querying and mining data streams: you only get one look a tutorial , 2002, SIGMOD '02.

[16]  Theodore Johnson,et al.  Gigascope: a stream database for network applications , 2003, SIGMOD '03.

[17]  Divesh Srivastava,et al.  Holistic UDAFs at streaming speeds , 2004, SIGMOD '04.

[18]  Rajeev Motwani,et al.  Random sampling for histogram construction: how much is enough? , 1998, SIGMOD '98.

[19]  R. Stephenson A and V , 1962, The British journal of ophthalmology.

[20]  Frederick Reiss,et al.  TelegraphCQ: continuous dataflow processing , 2003, SIGMOD '03.

[21]  Jeffrey Scott Vitter,et al.  Random sampling with a reservoir , 1985, TOMS.

[22]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[23]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, PODS '03.

[24]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[25]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[26]  Rajeev Motwani,et al.  Towards estimation error guarantees for distinct values , 2000, PODS.

[27]  Frank Olken,et al.  Random Sampling from Databases , 1993 .

[28]  S. Muthukrishnan,et al.  Estimating Rarity and Similarity over Data Stream Windows , 2002, ESA.