Approximating Large Frequency Moments with Pick-and-Drop Sampling

Given data stream D = {p 1,p 2,…,p m } of size m of numbers from {1,…, n}, the frequency of i is defined as f i = |{j: p j = i}|. The k-th frequency moment of D is defined as \(F_k = \sum_{i=1}^n f_i^k\). We consider the problem of approximating frequency moments in insertion-only streams for k ≥ 3. For any constant c we show an O(n 1 − 2/k log(n)log(c)(n)) upper bound on the space complexity of the problem. Here log(c)(n) is the iterative log function. Our main technical contribution is a non-uniform sampling method on matrices. We call our method a pick-and-drop sampling; it samples a heavy element (i.e., element i with frequency Ω(F k )) with probability Ω(1/n 1 − 2/k ) and gives approximation \(\tilde{f_i} \ge (1-\epsilon)f_i\). In addition, the estimations never exceed the real values, that is \( \tilde{f_j} \le f_j\) for all j. For constant e, we reduce the space complexity of finding a heavy element to O(n 1 − 2/k log(n)) bits. We apply our method of recursive sketches and resolve the problem with O(n 1 − 2/k log(n)log(c)(n)) bits. We reduce the ratio between the upper and lower bounds from O(log2(n)) to O(log(n)log(c)(n)). Thus, we provide a (roughly) quadratic improvement of the result of Andoni, Krauthgamer and Onak (FOCS 2011).

[1]  Anne D. Pick,et al.  A Developmental Study of Strategies of Visual Selectivity. , 1974 .

[2]  David P. Woodruff,et al.  Optimal approximations of the frequency moments of data streams , 2005, STOC '05.

[3]  Graham Cormode,et al.  Robust lower bounds for communication and stream computation , 2008, Theory Comput..

[4]  Piotr Indyk,et al.  Comparing Data Streams Using Hamming Norms (How to Zero In) , 2002, VLDB.

[5]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[6]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[7]  Atri Rudra,et al.  Lower bounds for randomized read/write stream algorithms , 2007, STOC '07.

[8]  R. Ostrovsky,et al.  Smooth Histograms for Sliding Windows , 2007, FOCS 2007.

[9]  Rafail Ostrovsky,et al.  Recursive Sketching For Frequency Moments , 2010, ArXiv.

[10]  Sumit Ganguly A Lower Bound for Estimating High Moments of a Data Stream , 2012, ArXiv.

[11]  Subhash Khot,et al.  Near-optimal lower bounds on the multi-party communication complexity of set disjointness , 2003, 18th IEEE Annual Conference on Computational Complexity, 2003. Proceedings..

[12]  Jessica H. Fong,et al.  An Approximate Lp Difference Algorithm for Massive Data Streams , 1999, Discret. Math. Theor. Comput. Sci..

[13]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[14]  Sumit Ganguly,et al.  Estimating Frequency Moments of Data Streams Using Random Linear Combinations , 2004, APPROX-RANDOM.

[15]  Ravi Kumar,et al.  An information statistics approach to data stream and communication complexity , 2004, J. Comput. Syst. Sci..

[16]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[17]  Ravi Kumar,et al.  An improved data stream algorithm for frequency moments , 2004, SODA '04.

[18]  Piotr Indyk,et al.  Stable distributions, pseudorandom generators, embeddings, and data stream computation , 2006, JACM.

[19]  Rafail Ostrovsky,et al.  How to catch L2-heavy-hitters on sliding windows , 2010, Theor. Comput. Sci..

[20]  David P. Woodruff,et al.  Numerical linear algebra in the streaming model , 2009, STOC '09.

[21]  Ping Li,et al.  Compressed counting , 2008, SODA.

[22]  Sumit Ganguly,et al.  Polynomial Estimators for High Frequency Moments , 2011, ArXiv.

[23]  David P. Woodruff,et al.  Optimal bounds for Johnson-Lindenstrauss transforms and streaming problems with sub-constant error , 2011, SODA '11.

[24]  David P. Woodruff,et al.  Tight bounds for distributed functional monitoring , 2011, STOC '12.

[25]  David P. Woodruff,et al.  On the exact space complexity of sketching and streaming small norms , 2010, SODA '10.

[26]  Andrew McGregor,et al.  Estimating statistical aggregates on probabilistic data streams , 2007, PODS.

[27]  Alexandr Andoni,et al.  Tight Lower Bound for Linear Sketches of Moments , 2013, ICALP.

[28]  David P. Woodruff Optimal space lower bounds for all frequency moments , 2004, SODA '04.

[29]  Mahesh Viswanathan,et al.  An Approximate L1-Difference Algorithm for Massive Data Streams , 2002, SIAM J. Comput..

[30]  A. Pick,et al.  A developmental study of strategies of visual selectivity. , 1974, Child development.

[31]  David P. Woodruff,et al.  An optimal algorithm for the distinct elements problem , 2010, PODS '10.

[32]  Graham Cormode,et al.  On Estimating Frequency Moments of Data Streams , 2007, APPROX-RANDOM.

[33]  David P. Woodruff Frequency Moments , 2009, Encyclopedia of Database Systems.

[34]  David P. Woodruff,et al.  Fast Manhattan sketches in data streams , 2010, PODS '10.

[35]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[36]  Rafail Ostrovsky,et al.  Generalizing the Layering Method of Indyk and Woodruff: Recursive Sketches for Frequency-Based Vectors on Streams , 2013, APPROX-RANDOM.

[37]  Ziv Bar-Yossef,et al.  An information statistics approach to data stream and communication complexity , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[38]  Alexandr Andoni,et al.  Streaming Algorithms via Precision Sampling , 2010, 2011 IEEE 52nd Annual Symposium on Foundations of Computer Science.

[39]  David P. Woodruff,et al.  Optimal Bounds for Johnson-Lindenstrauss Transforms and Streaming Problems with Subconstant Error , 2011, TALG.

[40]  CormodeGraham,et al.  Methods for finding frequent items in data streams , 2010, VLDB 2010.