Fast moment estimation in data streams in optimal space

We give a space-optimal streaming algorithm with update time O(log2(1/ε)loglog(1/ε)) for approximating the pth frequency moment, 0 < p < 2, of a length-n vector updated in a data stream up to a factor of 1 +/- ε. This provides a nearly exponential improvement over the previous space optimal algorithm of [Kane-Nelson-Woodruff, SODA 2010], which had update time Omega(1/eps2). When combined with the work of [Harvey-Nelson-Onak, FOCS 2008], we also obtain the first algorithm for entropy estimation in turnstile streams which simultaneously achieves near-optimal space and fast update time.

[1]  David P. Woodru Ecient and Private Distance Approximation in the Communication and Streaming Models , 2007 .

[2]  Ping Li,et al.  On Practical Algorithms for Entropy Estimation and the Improved Sample Complexity of Compressed Counting , 2010, ArXiv.

[3]  Anna Pagh,et al.  Uniform hashing in constant time and linear space , 2003, STOC '03.

[4]  Ravi Kumar,et al.  The One-Way Communication Complexity of Hamming Distance , 2008, Theory Comput..

[5]  David P. Woodruff,et al.  On the exact space complexity of sketching and streaming small norms , 2010, SODA '10.

[6]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[7]  Charu C. Aggarwal,et al.  On the Surprising Behavior of Distance Metrics in High Dimensional Spaces , 2001, ICDT.

[8]  J. Wissel,et al.  On the Best Constants in the Khintchine Inequality , 2007 .

[9]  S. Muthukrishnan,et al.  Estimating Entropy and Entropy Norm on Data Streams , 2006, Internet Math..

[10]  Noam Nisan,et al.  Pseudorandom generators for space-bounded computations , 1990, STOC '90.

[11]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[12]  David P. Woodruff,et al.  Numerical linear algebra in the streaming model , 2009, STOC '09.

[13]  Graham Cormode,et al.  A near-optimal algorithm for estimating the entropy of a stream , 2010, TALG.

[14]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[15]  Donald F. Towsley,et al.  Detecting anomalies in network traffic using maximum entropy estimation , 2005, IMC '05.

[16]  Jessica H. Fong,et al.  An Approximate Lp Difference Algorithm for Massive Data Streams , 1999, Discret. Math. Theor. Comput. Sci..

[17]  Sumit Ganguly,et al.  Simpler algorithm for estimating frequency moments of data streams , 2006, SODA '06.

[18]  Sudipto Guha,et al.  Streaming and sublinear approximation of entropy and information distances , 2005, SODA '06.

[19]  Sumit Ganguly,et al.  Estimating Frequency Moments of Data Streams Using Random Linear Combinations , 2004, APPROX-RANDOM.

[20]  Graham Cormode,et al.  Summarizing and Mining Inverse Distributions on Data Streams via Dynamic Inverse Sampling , 2005, VLDB.

[21]  Daniel M. Kane,et al.  Bounded Independence Fools Degree-2 Threshold Functions , 2010, 2010 IEEE 51st Annual Symposium on Foundations of Computer Science.

[22]  David P. Woodruff,et al.  Lower bounds for sparse recovery , 2010, SODA '10.

[23]  J. L. Nolan Stable Distributions. Models for Heavy Tailed Data , 2001 .

[24]  David P. Woodruff,et al.  A Near-Optimal Algorithm for L1-Difference , 2009, ArXiv.

[25]  Divesh Srivastava,et al.  Information Theory For Data Management , 2009, Proc. VLDB Endow..

[26]  Ashwin Lall,et al.  A data streaming algorithm for estimating entropies of od flows , 2007, IMC '07.

[27]  V. Zolotarev One-dimensional stable distributions , 1986 .

[28]  David P. Woodruff,et al.  Optimal approximations of the frequency moments of data streams , 2005, STOC '05.

[29]  Graham Cormode,et al.  A near-optimal algorithm for computing the entropy of a stream , 2007, SODA '07.

[30]  Anna Pagh,et al.  Uniform Hashing in Constant Time and Optimal Space , 2008, SIAM J. Comput..

[31]  Sumit Ganguly,et al.  Finding Frequent Items over General Update Streams , 2008, SSDBM.

[32]  Piotr Indyk,et al.  Fast mining of massive tabular data via approximate distance computations , 2002, Proceedings 18th International Conference on Data Engineering.

[33]  Ping Li,et al.  Estimators and tail bounds for dimension reduction in lα (0 < α ≤ 2) using stable random projections , 2008, SODA '08.

[34]  Subhash Khot,et al.  Near-optimal lower bounds on the multi-party communication complexity of set disjointness , 2003, 18th IEEE Annual Conference on Computational Complexity, 2003. Proceedings..

[35]  Piotr Indyk,et al.  Declaring independence via the sketching of sketches , 2008, SODA '08.

[36]  Sudipto Guha,et al.  Sketching information divergences , 2007, Machine Learning.

[37]  L FredmanMichael,et al.  Storing a Sparse Table with 0(1) Worst Case Access Time , 1984 .

[38]  T. S. Jayram,et al.  OPEN PROBLEMS IN DATA STREAMS AND RELATED TOPICS IITK WORKSHOP ON ALGORITHMS FOR DATA STREAMS ’06 , 2007 .

[39]  Noga Alon,et al.  Tracking join and self-join sizes in limited storage , 1999, PODS '99.

[40]  André Gronemeier,et al.  Asymptotically Optimal Lower Bounds on the NIH-Multi-Party Information Complexity of the AND-Function and Disjointness , 2009, STACS.

[41]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[42]  David P. Woodruff Efficient and private distance approximation in the communication and streaming models , 2007 .

[43]  Tamás Sarlós,et al.  Improved Approximation Algorithms for Large Matrices via Random Projections , 2006, 2006 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS'06).

[44]  Daniel M. Kane,et al.  A Derandomized Sparse Johnson-Lindenstrauss Transform , 2010, Electron. Colloquium Comput. Complex..

[45]  David P. Woodruff Optimal space lower bounds for all frequency moments , 2004, SODA '04.

[46]  Mikkel Thorup,et al.  Tabulation based 4-universal hashing with applications to second moment estimation , 2004, SODA '04.

[47]  C. Papadimitriou,et al.  The complexity of massive data set computations , 2002 .

[48]  Mahesh Viswanathan,et al.  An Approximate L1-Difference Algorithm for Massive Data Streams , 2002, SIAM J. Comput..

[49]  David P. Woodruff,et al.  1-pass relative-error Lp-sampling with applications , 2010, SODA '10.

[50]  T. S. Jayram Hellinger Strikes Back: A Note on the Multi-party Information Complexity of AND , 2009, APPROX-RANDOM.

[51]  Krzysztof Onak,et al.  Sketching and Streaming Entropy via Approximation Theory , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[52]  Tyng-Luh Liu,et al.  Sparse Representations for Image Decompositions , 1999, International Journal of Computer Vision.

[53]  Noam Nisan,et al.  Pseudorandom generators for space-bounded computation , 1992, Comb..

[54]  David P. Woodruff,et al.  Fast Manhattan sketches in data streams , 2010, PODS '10.

[55]  Sumit Ganguly,et al.  Estimating Entropy over Data Streams , 2006, ESA.

[56]  Piotr Indyk,et al.  Stable distributions, pseudorandom generators, embeddings, and data stream computation , 2006, JACM.

[57]  Michael E. Saks,et al.  Space lower bounds for distance approximation in the data stream model , 2002, STOC '02.

[58]  David P. Woodruff,et al.  Optimal Bounds for Johnson-Lindenstrauss Transforms and Streaming Problems with Subconstant Error , 2011, TALG.

[59]  R. Gregory Taylor,et al.  Modern computer algebra , 2002, SIGA.

[60]  Balachander Krishnamurthy,et al.  Sketch-based change detection: methods, evaluation, and applications , 2003, IMC '03.

[61]  Ziv Bar-Yossef,et al.  An information statistics approach to data stream and communication complexity , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..