Estimating Rarity and Similarity over Data Stream Windows

In the windowed data stream model, we observe items coming in over time. At any time t, we consider the window of the last N observations at-(N - 1), at-(N - 2), . . . , at, each ai ? {1, . . . , u}; we are required to support queries about the data in the window. A crucial restriction is that we are only allowed o(N) (often polylogarithmic in N) storage space, so not all items within the window can be archived.We study two basic problems in the windowed data stream model. The first is the estimation of the rarity of items in the window. Our second problem is one of estimating similarity between two data stream windows using the Jacard's coefficient. The problems of estimating rarity and similarity have many applications in mining massive data sets. We present novel, simple algorithms for estimating rarity and similarity on windowed data streams, accurate up to factor 1 ± ? using space only logarithmic in the window size.

[1]  Edith Cohen,et al.  Size-Estimation Framework with Applications to Transitive Closure and Reachability , 1997, J. Comput. Syst. Sci..

[2]  Dan Klein,et al.  Evaluating strategies for similarity search on the web , 2002, WWW '02.

[3]  Piotr Indyk,et al.  Maintaining stream statistics over sliding windows: (extended abstract) , 2002, SODA '02.

[4]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[5]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[6]  Sudipto Guha,et al.  Dynamic multidimensional histograms , 2002, SIGMOD '02.

[7]  Piotr Indyk,et al.  A small approximately min-wise independent family of hash functions , 1999, SODA '99.

[8]  Anne Rogers,et al.  Hancock: a language for extracting signatures from data streams , 2000, KDD '00.

[9]  Mahesh Viswanathan,et al.  An Approximate L1-Difference Algorithm for Massive Data Streams , 2002, SIAM J. Comput..

[10]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[11]  Jessica H. Fong,et al.  An Approximate Lp Difference Algorithm for Massive Data Streams , 1999, Discret. Math. Theor. Comput. Sci..

[12]  Rajeev Motwani,et al.  On random sampling over joins , 1999, SIGMOD '99.

[13]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[14]  H. Garcia-Molina,et al.  Computing Iceberg Queries E ciently , 1998 .

[15]  Piotr Indyk,et al.  Similarity Search in High Dimensions via Hashing , 1999, VLDB.

[16]  Prabhakar Raghavan,et al.  Computing on data streams , 1999, External Memory Algorithms.

[17]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[18]  Sudipto Guha,et al.  Approximating a data stream for querying and estimation: algorithms and performance evaluation , 2002, Proceedings 18th International Conference on Data Engineering.

[19]  Geoff Hulten,et al.  Mining time-changing data streams , 2001, KDD '01.

[20]  Sudipto Guha,et al.  Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.

[21]  S. Muthukrishnan,et al.  Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries , 2001, VLDB.

[22]  Srikanta Tirthapura,et al.  Estimating simple functions on the union of data streams , 2001, SPAA '01.

[23]  Edith Cohen,et al.  Finding interesting associations without support pruning , 2000, Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073).

[24]  Sudipto Guha,et al.  Data-streams and histograms , 2001, STOC '01.

[25]  Theodore Johnson,et al.  Mining database structure; or, how to build a data quality browser , 2002, SIGMOD '02.

[26]  Philippe Flajolet,et al.  Probabilistic counting , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[27]  Anna Barbera,et al.  The Amazon Project , 2002 .

[28]  J. Rexford,et al.  NetScope: Tra c Engineering for IP Networks , 1999 .

[29]  Geoff Hulten,et al.  Mining high-speed data streams , 2000, KDD '00.

[30]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[31]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[32]  Samuel Madden,et al.  Fjording the stream: an architecture for queries over streaming sensor data , 2002, Proceedings 18th International Conference on Data Engineering.

[33]  Torsten Suel,et al.  Optimal Histograms with Quality Guarantees , 1998, VLDB.

[34]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[35]  Anna C. Gilbert,et al.  QuickSAND: Quick Summary and Analysis of Network Data , 2001 .

[36]  Jennifer Widom,et al.  A Data Stream Management System for Network Traffic Management , 2001 .

[37]  Sudipto Guha,et al.  Clustering Data Streams , 2000, FOCS.