Estimating small frequency moments of data stream: a characteristic function approach

A data stream is viewed as a sequence of $M$ updates of the form $(\text{index},i,v)$ to an $n$-dimensional integer frequency vector $f$, where the update changes $f_i$ to $f_i + v$, and $v$ is an integer and assumed to be in $\{-m, ..., m\}$. The $p$th frequency moment $F_p$ is defined as $\sum_{i=1}^n \abs{f_i}^p$. We consider the problem of estimating $F_p$ to within a multiplicative approximation factor of $1\pm \epsilon$, for $p \in [0,2]$. Several estimators have been proposed for this problem, including Indyk's median estimator \cite{indy:focs00}, Li's geometric means estimator \cite{pinglib:2006}, an \Hss-based estimator \cite{gc:random07}. The first two estimators require space $\tilde{O}(\epsilon^{-2})$, where the $\tilde{O}$ notation hides polylogarithmic factors in $\epsilon^{-1}, m, n$ and $M$. Recently, Kane, Nelson and Woodruff in \cite{knw:soda10} present a space-optimal and novel estimator, called the log-cosine estimator. In this paper, we present an elementary analysis of the log-cosine estimator in a stand-alone setting. The analysis in \cite{knw:soda10} is more complicated.

[1]  Mahesh Viswanathan,et al.  An Approximate L1-Difference Algorithm for Massive Data Streams , 2002, SIAM J. Comput..

[2]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[3]  Srikanta Tirthapura,et al.  Estimating simple functions on the union of data streams , 2001, SPAA '01.

[4]  David P. Woodruff,et al.  On the exact space complexity of sketching and streaming small norms , 2010, SODA '10.

[5]  Ping Li,et al.  Estimators and tail bounds for dimension reduction in lα (0 < α ≤ 2) using stable random projections , 2008, SODA '08.

[6]  Subhash Khot,et al.  Near-optimal lower bounds on the multi-party communication complexity of set disjointness , 2003, 18th IEEE Annual Conference on Computational Complexity, 2003. Proceedings..

[7]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[8]  Ziv Bar-Yossef,et al.  An information statistics approach to data stream and communication complexity , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[9]  Piotr Indyk,et al.  Stable distributions, pseudorandom generators, embeddings, and data stream computation , 2006, JACM.

[10]  Michael E. Saks,et al.  Space lower bounds for distance approximation in the data stream model , 2002, STOC '02.

[11]  Piotr Indyk,et al.  Fast mining of massive tabular data via approximate distance computations , 2002, Proceedings 18th International Conference on Data Engineering.

[12]  David P. Woodruff Optimal space lower bounds for all frequency moments , 2004, SODA '04.

[13]  Sudipto Guha,et al.  Fast, small-space algorithms for approximate histogram maintenance , 2002, STOC '02.

[14]  Graham Cormode,et al.  On Estimating Frequency Moments of Data Streams , 2007, APPROX-RANDOM.

[15]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[16]  Noam Nisan,et al.  Pseudorandom generators for space-bounded computation , 1992, Comb..