Range Efficient Computation of F0 over Massive Data Streams

Efficient one-pass computation of F 0, the number of distinct elements in a data stream, is a fundamental problem arising in various contexts in databasesand networking.We consider the problem of efficiently estimating F0 of a data stream where each element of the stream is an interval of integers. We present a randomized algorithm which gives an (�, δ ) approximation of F0, with the following time complexity (n is the size of the universe of the items): (1)The amortized processing time per interval is O(log 1 log n � ). (2)The time toanswer a queryfor F0 is O(log 1/δ). The workspaceused is O( 1

[1]  Larry Carter,et al.  Universal Classes of Hash Functions , 1979, J. Comput. Syst. Sci..

[2]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[3]  Jeffrey F. Naughton,et al.  Sampling-Based Estimation of the Number of Distinct Values of an Attribute , 1995, VLDB.

[4]  Oded Goldreich,et al.  A Sample of Samplers - A Computational Perspective on Sampling (survey) , 1997, Electron. Colloquium Comput. Complex..

[5]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[6]  Phillip B. Gibbons Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports , 2001, VLDB.

[7]  Srikanta Tirthapura,et al.  Estimating simple functions on the union of data streams , 2001, SPAA '01.

[8]  Ziv Bar-Yossef,et al.  Reductions in streaming algorithms, with an application to counting triangles in graphs , 2002, SODA '02.

[9]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[10]  Mahesh Viswanathan,et al.  An Approximate L1-Difference Algorithm for Massive Data Streams , 2002, SIAM J. Comput..

[11]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[12]  Ravi Kumar,et al.  Approximate counting of inversions in a data stream , 2002, STOC '02.

[13]  Jennifer Widom,et al.  Models and issues in data stream systems , 2002, PODS.

[14]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[15]  P. Indyk,et al.  Comparing Data Streams Using Hamming Norms (How to Zero In) , 2002, Very Large Data Bases Conference.

[16]  Graham Cormode,et al.  Estimating Dominance Norms of Multiple Data Streams , 2003, ESA.

[17]  David P. Woodruff,et al.  Tight lower bounds for the distinct elements problem , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[18]  Srinivasan Seshan,et al.  Synopsis diffusion for robust aggregation in sensor networks , 2004, SenSys '04.

[19]  Ravi Kumar,et al.  An improved data stream algorithm for frequency moments , 2004, SODA '04.

[20]  Jeffrey Considine,et al.  Approximate aggregation techniques for sensor databases , 2004, Proceedings. 20th International Conference on Data Engineering.

[21]  A. Robert Calderbank,et al.  Improved range-summable random variable construction algorithms , 2005, SODA '05.