Identifying the Most Recent Heavy Hitters in Large-Scale Streams Using Block-wise Counting

Identifying the most recent heavy hitters, i.e., finding the items with the highest appearances in a high speed data stream is a fundamental problem in real-time stream processing. The requirement of real-time stream applications raises significant challenges to this problem in terms of the processing latency, the space usage and the precision. Traditional schemes leverage the sliding windows based design which is hard to support both high precision, and low space usage of heavy hitters identification. In this work, we propose a novel Block-wise Counting scheme, which can partition the streams into tiny blocks to support high precision and low latency of heavy hitters identification with low space cost. The experiment results show that our scheme significantly improves the identification precision by 65% and reduces the processing latency by 87% compared to state-of-the-art designs.

[1]  Roy Friedman,et al.  TinyLFU: A Highly Efficient Cache Admission Policy , 2014, 2014 22nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[2]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[3]  Fan Zhang,et al.  Popularity-aware differentiated distributed stream processing on skewed streams , 2017, 2017 IEEE 25th International Conference on Network Protocols (ICNP).

[4]  Gaogang Xie,et al.  SF-sketch: A Fast, Accurate, and Memory Efficient Data Structure to Store Frequencies of Data Items , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[5]  Filippo Menczer,et al.  Design and prototyping of a social media observatory , 2013, WWW.

[6]  Roy Friedman,et al.  Heavy hitters in streams and sliding windows , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[7]  Gurmeet Singh Manku,et al.  Approximate counts and quantiles over sliding windows , 2004, PODS.

[8]  Arnd Christian König,et al.  Time Adaptive Sketches (Ada-Sketches) for Summarizing Data Streams , 2016, SIGMOD Conference.

[9]  Meng Li,et al.  A quantitative model for intraday stock price changes based on order flows , 2014, J. Syst. Sci. Complex..

[10]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[11]  Bugra Gedik Partitioning functions for stateful data parallelism in stream processing , 2013, The VLDB Journal.

[12]  Gustavo Alonso,et al.  Augmented Sketch: Faster and More Accurate Stream Processing , 2016, SIGMOD Conference.

[13]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[14]  Guillaume Pitel,et al.  Count-Min-Log sketch: Approximately counting with approximate counters , 2015, ArXiv.

[15]  Daniel Mills,et al.  MillWheel: Fault-Tolerant Stream Processing at Internet Scale , 2013, Proc. VLDB Endow..

[16]  Xiaoyong Du,et al.  Persistent Data Sketching , 2015, SIGMOD Conference.