How to catch L2-heavy-hitters on sliding windows

Finding heavy-elements (heavy-hitters) in streaming data is one of the central, and well-understood tasks. Despite the importance of this problem, when considering the sliding windows model of streaming (where elements eventually expire) the problem of finding L 2 -heavy elements has remained completely open despite multiple papers and considerable success in finding L 1 -heavy elements.Since the L 2 -heavy element problem doesn't satisfy certain conditions, existing methods for sliding windows algorithms, such as smooth histograms or exponential histograms are not directly applicable to it. In this paper, we develop the first polylogarithmic-memory algorithm for finding L 2 -heavy elements in the sliding window model.Our technique allows us not only to find L 2 -heavy elements, but also heavy elements with respect to any L p with 0 < p ? 2 on sliding windows. By this we completely "close the gap" and resolve the question of finding L p -heavy elements in the sliding window model with polylogarithmic memory, since it is well known that for p 2 this task is impossible.We demonstrate a broader applicability of our method on two additional examples: we show how to obtain a sliding window approximation of the similarity of two streams, and of the fraction of elements that appear exactly a specified number of times within the window (the α-rarity problem). In these two illustrative examples of our method, we replace the current expected memory bounds with worst case bounds.

[1]  Ziv Bar-Yossef,et al.  Reductions in streaming algorithms, with an application to counting triangles in graphs , 2002, SODA '02.

[2]  Dennis Shasha,et al.  StatStream: Statistical Monitoring of Thousands of Data Streams in Real Time , 2002, VLDB.

[3]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[4]  Ziv Bar-Yossef,et al.  An information statistics approach to data stream and communication complexity , 2002, The 43rd Annual IEEE Symposium on Foundations of Computer Science, 2002. Proceedings..

[5]  David P. Woodruff,et al.  1-pass relative-error Lp-sampling with applications , 2010, SODA '10.

[6]  Andrei Z. Broder,et al.  On the resemblance and containment of documents , 1997, Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171).

[7]  Srikanta Tirthapura,et al.  Estimating simple functions on the union of data streams , 2001, SPAA '01.

[8]  Rafail Ostrovsky,et al.  How to catch L2-heavy-hitters on sliding windows , 2010, Theor. Comput. Sci..

[9]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[10]  Edith Cohen,et al.  Maintaining time-decaying stream aggregates , 2006, J. Algorithms.

[11]  Zhengding Lu,et al.  Approximate frequency counts in sliding window over data stream , 2005, Canadian Conference on Electrical and Computer Engineering, 2005..

[12]  Philippe Flajolet,et al.  Probabilistic counting , 1983, 24th Annual Symposium on Foundations of Computer Science (sfcs 1983).

[13]  Aoying Zhou,et al.  Dynamically maintaining frequent items over a data stream , 2003, CIKM '03.

[14]  David P. Woodruff,et al.  Optimal approximations of the frequency moments of data streams , 2005, STOC '05.

[15]  Richard M. Karp,et al.  A simple algorithm for finding frequent elements in streams and bags , 2003, TODS.

[16]  Michael E. Saks,et al.  Space lower bounds for distance approximation in the data stream model , 2002, STOC '02.

[17]  S. Muthukrishnan,et al.  Data streams: algorithms and applications , 2005, SODA '03.

[18]  Lap-Kei Lee,et al.  A simpler and more efficient deterministic scheme for finding frequent items over sliding windows , 2006, PODS '06.

[19]  S. Muthukrishnan,et al.  Estimating Rarity and Similarity over Data Stream Windows , 2002, ESA.

[20]  Noga Alon,et al.  The space complexity of approximating the frequency moments , 1996, STOC '96.

[21]  Charu C. Aggarwal,et al.  Data Streams: Models and Algorithms (Advances in Database Systems) , 2006 .

[22]  Piotr Indyk,et al.  A small approximately min-wise independent family of hash functions , 1999, SODA '99.

[23]  Geoffrey Zweig,et al.  Syntactic Clustering of the Web , 1997, Comput. Networks.

[24]  Jia Wang,et al.  Analyzing peer-to-peer traffic across large networks , 2002, IEEE/ACM Transactions on Networking.

[25]  Piotr Indyk,et al.  Maintaining stream statistics over sliding windows: (extended abstract) , 2002, SODA '02.

[26]  Divyakant Agrawal,et al.  Fast Algorithms for Heavy Distinct Hitters using Associative Memories , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[27]  Rafail Ostrovsky,et al.  Smooth Histograms for Sliding Windows , 2007, 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS'07).

[28]  Edith Cohen,et al.  Size-Estimation Framework with Applications to Transitive Closure and Reachability , 1997, J. Comput. Syst. Sci..

[29]  Alan M. Frieze,et al.  Min-Wise Independent Permutations , 2000, J. Comput. Syst. Sci..

[30]  Rajeev Motwani,et al.  Approximate Frequency Counts over Data Streams , 2012, VLDB.

[31]  David P. Woodruff,et al.  An optimal algorithm for the distinct elements problem , 2010, PODS '10.

[32]  Erik D. Demaine,et al.  Frequency Estimation of Internet Packet Streams with Limited Space , 2002, ESA.

[33]  Graham Cormode,et al.  What's hot and what's not: tracking most frequent items dynamically , 2003, PODS '03.

[34]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[35]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[36]  R. Ostrovsky,et al.  Zero-one frequency laws , 2010, STOC '10.

[37]  Sumit Ganguly,et al.  Simpler algorithm for estimating frequency moments of data streams , 2006, SODA '06.

[38]  David P. Woodruff,et al.  A General Method for Estimating Correlated Aggregates over a Data Stream , 2012, ICDE.

[39]  Divesh Srivastava,et al.  Finding Hierarchical Heavy Hitters in Data Streams , 2003, VLDB.

[40]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[41]  Erik D. Demaine,et al.  Identifying frequent items in sliding windows over on-line packet streams , 2003, IMC '03.

[42]  Charu C. Aggarwal,et al.  Data Streams - Models and Algorithms , 2014, Advances in Database Systems.

[43]  Yong Guan,et al.  Frequency Estimation over Sliding Windows , 2008, 2008 IEEE 24th International Conference on Data Engineering.

[44]  Hing-Fung Ting,et al.  Finding Heavy Hitters over the Sliding Window of a Weighted Data Stream , 2008, LATIN.

[45]  T. S. Jayram,et al.  OPEN PROBLEMS IN DATA STREAMS AND RELATED TOPICS IITK WORKSHOP ON ALGORITHMS FOR DATA STREAMS ’06 , 2007 .

[46]  George Varghese,et al.  New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice , 2003, TOCS.

[47]  Lap-Kei Lee,et al.  Finding frequent items over sliding windows with constant update time , 2010, Inf. Process. Lett..