Universal Streaming

Given a stream of data, a typical approach in streaming algorithms is to design a sophisticated algorithm with small memory that computes a specific statistic over the streaming data. Usually, if one wants to compute a different statistic after the stream is gone, it is impossible. But what if we want to compute a different statistic after the fact? In this paper, we consider the following fascinating possibility: can we collect some small amount of specific data during the stream that is “universal,” i.e., where we do not know anything about the statistics we will want to later compute, other than the guarantee that had we known the statistic ahead of time, it would have been possible to do so with small memory? In other words, is it possible to collect some data in small space during the stream, such that any other statistic that can be computed with comparable space can be computed after the fact? This is indeed what we introduce (and show) in this paper with matching upper and lower bounds: we show that it is possible to collect universal statistics of polylogarithmic size, and prove that these universal statistics allow us after the fact to compute all other statistics that are computable with similar amounts of memory. We show that this is indeed possible, both for the standard unbounded streaming model and the sliding window streaming model.

[1]  Kasturi R. Varadarajan,et al.  Geometric Approximation via Coresets , 2007 .

[2]  Sudipto Guha,et al.  Streaming-data algorithms for high-quality clustering , 2002, Proceedings 18th International Conference on Data Engineering.

[3]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[4]  Erik D. Demaine,et al.  Identifying frequent items in sliding windows over on-line packet streams , 2003, IMC '03.

[5]  Divyakant Agrawal,et al.  Fast Algorithms for Heavy Distinct Hitters using Associative Memories , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[6]  R. Ostrovsky,et al.  Zero-one frequency laws , 2010, STOC '10.

[7]  Piotr Indyk,et al.  Comparing Data Streams Using Hamming Norms (How to Zero In) , 2002, VLDB.

[8]  Rafail Ostrovsky,et al.  Smooth Histograms for Sliding Windows , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[9]  Srikanta Tirthapura,et al.  Distributed Streams Algorithms for Sliding Windows , 2002, SPAA '02.

[10]  Aoying Zhou,et al.  Dynamically maintaining frequent items over a data stream , 2003, CIKM '03.

[11]  Yong Guan,et al.  Frequency Estimation over Sliding Windows , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[12]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[13]  Sumit Ganguly,et al.  Estimating Frequency Moments of Data Streams Using Random Linear Combinations , 2004, APPROX-RANDOM.

[14]  Hing-Fung Ting,et al.  Finding Heavy Hitters over the Sliding Window of a Weighted Data Stream , 2008, LATIN.

[15]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[16]  Sumit Ganguly,et al.  Simpler algorithm for estimating frequency moments of data streams , 2006, SODA '06.

[17]  Sudipto Guha,et al.  Streaming and sublinear approximation of entropy and information distances , 2005, SODA '06.

[18]  Mahesh Viswanathan,et al.  An Approximate L1-Difference Algorithm for Massive Data Streams , 2002, SIAM J. Comput..

[19]  Lap-Kei Lee,et al.  A simpler and more efficient deterministic scheme for finding frequent items over sliding windows , 2006, PODS '06.

[20]  Krzysztof Onak,et al.  Sketching and Streaming Entropy via Approximation Theory , 2008, 2008 49th Annual IEEE Symposium on Foundations of Computer Science.

[21]  Russ Bubley,et al.  Randomized algorithms , 1995, CSUR.

[22]  Rajeev Motwani,et al.  Maintaining variance and k-medians over data stream windows , 2003, PODS.

[23]  David P. Woodruff,et al.  On the exact space complexity of sketching and streaming small norms , 2010, SODA '10.

[24]  Rafail Ostrovsky,et al.  Generalizing the Layering Method of Indyk and Woodruff: Recursive Sketches for Frequency-Based Vectors on Streams , 2013, APPROX-RANDOM.

[25]  Ziv Bar-Yossef,et al.  An information statistics approach to data stream and communication complexity , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[26]  David P. Woodruff,et al.  An optimal algorithm for the distinct elements problem , 2010, PODS '10.

[27]  Graham Cormode,et al.  On Estimating Frequency Moments of Data Streams , 2007, APPROX-RANDOM.

[28]  Joan Feigenbaum,et al.  Computing Diameter in the Streaming and Sliding-Window Models , 2002, Algorithmica.

[29]  Subhash Khot,et al.  Near-optimal lower bounds on the multi-party communication complexity of set disjointness , 2003, 18th IEEE Annual Conference on Computational Complexity, 2003. Proceedings..

[30]  Timothy M. Chan,et al.  Geometric Optimization Problems over Sliding Windows , 2006, Int. J. Comput. Geom. Appl..

[31]  Dawn Xiaodong Song,et al.  New Streaming Algorithms for Fast Detection of Superspreaders , 2005, NDSS.

[32]  Vyas Sekar,et al.  Data streaming algorithms for estimating entropy of network traffic , 2006, SIGMETRICS '06/Performance '06.

[33]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, PODS '03.

[34]  Graham Cormode,et al.  A near-optimal algorithm for computing the entropy of a stream , 2007, SODA '07.

[35]  Sumit Ganguly,et al.  Estimating Entropy over Data Streams , 2006, ESA.

[36]  Ravi Kumar,et al.  An improved data stream algorithm for frequency moments , 2004, SODA '04.

[37]  Piotr Indyk,et al.  Stable distributions, pseudorandom generators, embeddings, and data stream computation , 2006, JACM.

[38]  Noam Nisan,et al.  Pseudorandom generators for space-bounded computations , 1990, STOC '90.

[39]  Zhengding Lu,et al.  Approximate frequency counts in sliding window over data stream , 2005, Canadian Conference on Electrical and Computer Engineering, 2005..

[40]  David P. Woodruff,et al.  Tight lower bounds for the distinct elements problem , 2003, 44th Annual IEEE Symposium on Foundations of Computer Science, 2003. Proceedings..

[41]  List of Open Problems in Sublinear Algorithms , .

[42]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[43]  Rafail Ostrovsky,et al.  How to catch L2-heavy-hitters on sliding windows , 2014, Theor. Comput. Sci..

[44]  David P. Woodruff,et al.  Optimal approximations of the frequency moments of data streams , 2005, STOC '05.

[45]  ViswanathanMahesh,et al.  An Approximate L1-Difference Algorithm for Massive Data Streams , 2003 .

[46]  S. Muthukrishnan,et al.  Estimating Rarity and Similarity over Data Stream Windows , 2002, ESA.

[47]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[48]  David P. Woodruff Optimal space lower bounds for all frequency moments , 2004, SODA '04.

[49]  Lap-Kei Lee,et al.  Finding frequent items over sliding windows with constant update time , 2010, Inf. Process. Lett..

[50]  David P. Woodruff,et al.  Turnstile streaming algorithms might as well be linear sketches , 2014, STOC.

[51]  Piotr Indyk,et al.  Maintaining Stream Statistics over Sliding Windows , 2002, SIAM J. Comput..

[52]  Rajeev Motwani,et al.  Sampling from a moving window over streaming data , 2002, SODA '02.

[53]  Rafail Ostrovsky,et al.  How to catch L2-heavy-hitters on sliding windows , 2010, Theor. Comput. Sci..

[54]  Ping Li,et al.  Compressed counting , 2008, SODA.

[55]  Philip S. Yu,et al.  Moment: maintaining closed frequent itemsets over a stream sliding window , 2004, Fourth IEEE International Conference on Data Mining (ICDM'04).

[56]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[57]  S. Muthukrishnan,et al.  Estimating Entropy and Entropy Norm on Data Streams , 2006, Internet Math..