Thresholded Monitoring in Distributed Data Streams

In this paper, we consider the problem of thresholded monitoring in distributed data streams, that is, given multiple distributed data streams observed by multiple monitors during a certain period, finding the items whose global frequencies overall data streams exceeding a given threshold. We first derive a lower bound of communication overhead for any deterministic algorithm for this problem. Then, we propose two different schemas, i.e., Low-threshold Cascaded Cuckoo Filter (L-CCF) for low-threshold monitoring and High-threshold Cascaded Cuckoo Filter (H-CCF) for high-threshold monitoring. L-CCF and H-CCF can identify items whose frequency are more than the given threshold while a desired false negative rate (FNR) is achieved and communication overhead is optimized. The key idea is to compress the communication overhead caused by transferring the ID and frequency information at the same time. First, to reduce the communication overhead of transferring IDs, we propose to encode the IDs into separate tiny parts and store these tiny parts in L-CCF or H-CCF. Second, to reduce the communication overhead of transferring frequencies, we adopt carry-in counter technique in L-CCF and multiple sampling technique in H-CCF. We evaluated L-CCF and H-CCF on two real-world traces and compared their performance with two prior adapted algorithms. Our experimental results show that on average, L-CCF and H-CCF achieve FNRs with 55.7% and 65.56% better than that of comparison algorithms while FPRs is maintained at the level of 2.23%.

[1]  Bin Fan,et al.  Cuckoo Filter: Practically Better Than Bloom , 2014, CoNEXT.

[2]  Haipeng Dai,et al.  Finding Persistent Items in Distributed Datasets , 2018, IEEE INFOCOM 2018 - IEEE Conference on Computer Communications.

[3]  Christopher Olston,et al.  Finding (recently) frequent items in distributed data streams , 2005, 21st International Conference on Data Engineering (ICDE'05).

[4]  Graham Cormode,et al.  Sketching Streams Through the Net: Distributed Approximate Query Tracking , 2005, VLDB.

[5]  Gaogang Xie,et al.  A Shifting Bloom Filter Framework for Set Queries , 2015, Proc. VLDB Endow..

[6]  Erik D. Demaine,et al.  Frequency Estimation of Internet Packet Streams with Limited Space , 2002, ESA.

[7]  Qin Zhang,et al.  Optimal tracking of distributed heavy hitters and quantiles , 2009, PODS.

[8]  Minlan Yu,et al.  Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing , 2018, SIGMOD Conference.

[9]  Minlan Yu,et al.  FlowRadar: A Better NetFlow for Data Centers , 2016, NSDI.

[10]  Haipeng Dai,et al.  Finding Persistent Items in Data Streams , 2016, Proc. VLDB Endow..

[11]  Rajeev Motwani,et al.  Computing Iceberg Queries Efficiently , 1998, VLDB.

[12]  Gerhard Weikum,et al.  KLEE: A Framework for Distributed Top-k Query Algorithms , 2005, VLDB.

[13]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[14]  Vladimir Braverman,et al.  One Sketch to Rule Them All: Rethinking Network Flow Monitoring with UnivMon , 2016, SIGCOMM.

[15]  Richard M. Karp,et al.  A simple algorithm for finding frequent elements in streams and bags , 2003, TODS.

[16]  Abhishek Kumar,et al.  Space-code bloom filter for efficient traffic flow measurement , 2003, IMC '03.

[17]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[18]  Wei Wang,et al.  Noisy Bloom Filters for Multi-Set Membership Testing , 2016, SIGMETRICS.

[19]  Guihai Chen,et al.  Identifying and Estimating Persistent Items in Data Streams , 2018, IEEE/ACM Transactions on Networking.

[20]  Gaogang Xie,et al.  SF-sketch: A Fast, Accurate, and Memory Efficient Data Structure to Store Frequencies of Data Items , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[21]  Divyakant Agrawal,et al.  An integrated efficient solution for computing frequent and top-k elements in data streams , 2006, TODS.

[22]  Peng Liu,et al.  Elastic sketch: adaptive and fast network-wide measurements , 2018, SIGCOMM.

[23]  Jayadev Misra,et al.  Finding Repeated Elements , 1982, Sci. Comput. Program..

[24]  Christopher Olston,et al.  Distributed top-k monitoring , 2003, SIGMOD '03.