Dynamic Count-Min Sketch for Analytical Queries Over Continuous Data Streams

The methods of approximate query processing have been proposed for analytics over high-speed data streams, which compact continuous streams into a space-constrained sketch and provide reliable estimates for different queries. Count-Min (CM) is the state-of-the-art sketching structure supporting many queries with error-guaranteed estimates under limited space. However, we need to create a counter table beforehand in CM according to the size of data streams, while it is usually unpredictable for dynamic data streams. In this paper, we proposed an approach, called Dynamic Count-Min sketch (DCM), which is appropriate for dynamic data set and can provide accurate estimates for point query and self-join size query. Our approach constitutes incremental CM sketches and allocates space in a pay-as-you-go manner. Our mathematical analysis and substantial experiments both show that our approach is appropriate for data sets with dynamic or skewed inputs and can provide error-guaranteed estimates with less space compared to CM.

[1]  Vladimir Braverman,et al.  One Sketch to Rule Them All: Rethinking Network Flow Monitoring with UnivMon , 2016, SIGCOMM.

[2]  Viktor K. Prasanna,et al.  Sketch Acceleration on FPGA and its Applications in Network Anomaly Detection , 2018, IEEE Transactions on Parallel and Distributed Systems.

[3]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[4]  Odysseas Papapetrou,et al.  Sketching distributed sliding-window data streams , 2015, The VLDB Journal.

[5]  S. Muthukrishnan,et al.  How to scalably and accurately skip past streams , 2007, 2007 IEEE 23rd International Conference on Data Engineering Workshop.

[6]  Noga Alon,et al.  The Space Complexity of Approximating the Frequency Moments , 1999 .

[7]  Yin Zhang,et al.  Improving sketch reconstruction accuracy using linear least squares method , 2005, IMC '05.

[8]  Fan Deng New Estimation Algorithms for Streaming Data : Count-min Can Do More , 2022 .

[9]  David Hutchison,et al.  Scalable Bloom Filters , 2007, Inf. Process. Lett..

[10]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[11]  Yong Guan,et al.  Detecting Click Fraud in Pay-Per-Click Streams of Online Advertising Networks , 2008, 2008 The 28th International Conference on Distributed Computing Systems.

[12]  Alexander Hall,et al.  HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm , 2013, EDBT '13.

[13]  Ehsan Eydi,et al.  Buffered Count-Min Sketch , 2017 .

[14]  Gustavo Alonso,et al.  Augmented Sketch: Faster and More Accurate Stream Processing , 2016, SIGMOD Conference.

[15]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[16]  Jie Wu,et al.  The Dynamic Bloom Filters , 2010, IEEE Transactions on Knowledge and Data Engineering.

[17]  Barzan Mozafari,et al.  SnappyData: A Unified Cluster for Streaming, Transactions and Interactice Analytics , 2017, CIDR.

[18]  Sasu Tarkoma,et al.  Theory and Practice of Bloom Filters for Distributed Systems , 2012, IEEE Communications Surveys & Tutorials.

[19]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[20]  Gaogang Xie,et al.  SF-sketch: A Fast, Accurate, and Memory Efficient Data Structure to Store Frequencies of Data Items , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).