Counting distinct items over update streams

In data streaming applications, data arrives at rapid rates and in high volume, thus making it essential to process each stream update very efficiently in terms of both time and space. A data stream is a sequence of data records that must be processed continuously in an online fashion using sub-linear space and sub-linear processing time. We consider the problem of tracking the number of distinct items over data streams that allow insertion and deletion operations. We present two algorithms that improve on the space and time complexity of existing algorithms.

[1]  Piotr Indyk,et al.  Comparing Data Streams Using Hamming Norms (How to Zero In) , 2002, VLDB.

[2]  Alan M. Frieze,et al.  Min-wise independent permutations (extended abstract) , 1998, STOC '98.

[3]  Aravind Srinivasan,et al.  Chernoff-Hoeffding bounds for applications with limited independence , 1995, SODA '93.

[4]  Rajeev Rastogi,et al.  Streaming Algorithms for Robust, Real-Time Detection of DDoS Attacks , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[5]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[6]  Srikanta Tirthapura,et al.  Distributed Streams Algorithms for Sliding Windows , 2002, SPAA '02.

[7]  Srikanta Tirthapura,et al.  Estimating simple functions on the union of data streams , 2001, SPAA '01.

[8]  Devdatt P. Dubhashi,et al.  Negative dependence through the FKG Inequality , 1996 .

[9]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[10]  Larry Carter,et al.  New Hash Functions and Their Use in Authentication and Set Equality , 1981, J. Comput. Syst. Sci..

[11]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[12]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[13]  Noam Nisan,et al.  Pseudorandom generators for space-bounded computation , 1992, Comb..

[14]  Devdatt P. Dubhashi,et al.  Balls and bins: A study in negative dependence , 1996, Random Struct. Algorithms.

[15]  David P. Woodruff,et al.  Tight lower bounds for the distinct elements problem , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[16]  Rajeev Rastogi,et al.  Processing set expressions over continuous update streams , 2003, SIGMOD '03.

[17]  Noam Nisan,et al.  Pseudorandom generators for space-bounded computations , 1990, STOC '90.

[18]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..