On-Off Sketch: A Fast and Accurate Sketch on Persistence

Approximate stream processing has attracted much attention recently. Prior art mostly focuses on characteristics like frequency, cardinality, and quantile. Persistence, as a new characteristic, is getting increasing attention. Unlike frequency, persistence highlights behaviors where an item appears recurrently in many time windows of a data stream. There are two typical problems with persistence – persistence estimation and finding persistent items. In this paper, we propose the On-Off sketch to address both problems. For persistence estimation, using the characteristic that the persistence of an item is increased periodically, we compress increments when multiple items are mapped to the same counter, which significantly reduces the error. Compared with the Count-Min sketch, 1) in theory, we prove that the error of the On-Off sketch is always smaller; 2) in experiments, the On-Off sketch achieves around 6.17 times smaller error and 2.2 times higher throughput. For finding persistent items, we propose a technique to separate persistent and non-persistent items, further improving the accuracy. We show that the space complexity of our On-Off sketch is much better than the state-of-the-art (PIE), and it reduces the error up to 4 orders of magnitude and achieves 2.84 times higher throughput than prior algorithms in experiments. PVLDB Reference Format: Yinda Zhang, Jinyang Li, Yutian Lei, Tong Yang, Zhetao Li, Gong Zhang, and Bin Cui. On-Off Sketch: A Fast and Accurate Sketch on Persistence. PVLDB, 14(2): 128-140, 2021. doi:10.14778/3425879.3425884 ∗Department of Computer Science and Technology, Peking University, China †The Key Laboratory of Hunan Province for Internet of Things and Information Security and College of Computer Science, Xiangtan University, Xiangtan, China ‡PCL Research Center of Networks and Communications, Pengcheng Laboratory §Huawei Theory Lab, China ¶National Engineering Laboratory for Big Data Analysis Technology and Application (PKU), China Corresponding author: Tong Yang {yangtongemail@gmail.com}. This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 14, No. 2 ISSN 2150-8097. doi:10.14778/3425879.3425884

[1]  Lei Zou,et al.  HeavyGuardian: Separate and Guard Hot Items in Data Streams , 2018, KDD.

[2]  Guoliang Li,et al.  Approximate Query Processing: What is New and Where to Go? , 2018, Data Science and Engineering.

[3]  Yi Wang,et al.  Sliding Sketches: A Framework using Time Zones for Data Stream Processing in Sliding Windows , 2020, KDD.

[4]  Colin Tankard,et al.  Advanced Persistent threats and how to monitor and deter them , 2011, Netw. Secur..

[5]  Alexander Zeier,et al.  SIMD-Scan: Ultra Fast in-Memory Table Scan using on-Chip Vector Processing Units , 2009, Proc. VLDB Endow..

[6]  Gustavo Alonso,et al.  Augmented Sketch: Faster and More Accurate Stream Processing , 2016, SIGMOD Conference.

[7]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[8]  Václav Přenosil,et al.  Advanced Persistent Threat Attack Detection: An Overview , 2014 .

[9]  Stuart Staniford-Chen,et al.  Practical Automated Detection of Stealthy Portscans , 2002, J. Comput. Secur..

[10]  Tong Yang,et al.  Finding Significant Items in Data Streams , 2019, 2019 IEEE 35th International Conference on Data Engineering (ICDE).

[11]  Sanjeev Khanna,et al.  Space-efficient online computation of quantile summaries , 2001, SIGMOD '01.

[12]  Duane Wessels,et al.  High‐performance benchmarking with Web Polygraph , 2004, Softw. Pract. Exp..

[13]  Daniel Ting,et al.  Data Sketches for Disaggregated Subset Sum and Frequent Item Estimation , 2017, SIGMOD Conference.

[14]  Minlan Yu,et al.  Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing , 2018, SIGMOD Conference.

[15]  Lu Wang,et al.  Quantiles over data streams: an experimental study , 2013, SIGMOD '13.

[16]  David M. W. Powers,et al.  Applications and Explanations of Zipf’s Law , 1998, CoNLL.

[17]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[18]  George Varghese,et al.  New directions in traffic measurement and accounting , 2002, CCRV.

[19]  Qing Chen,et al.  Graph Stream Summarization: From Big Bang to Big Crunch , 2016, SIGMOD Conference.

[20]  Xinyu Wang,et al.  Real-time intelligent big data processing: technology, platform, and applications , 2019, Science China Information Sciences.

[21]  Nicole Immorlica,et al.  Click Fraud Resistant Methods for Learning Click-Through Rates , 2005, WINE.

[22]  Saikat Guha,et al.  How healthy are today's enterprise networks? , 2008, IMC '08.

[23]  Xuemin Lin,et al.  Tracking Matrix Approximation over Distributed Sliding Windows , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[24]  Shigang Chen,et al.  Estimating the Persistent Spreads in High-Speed Networks , 2014, 2014 IEEE 22nd International Conference on Network Protocols.

[25]  Tong Yang,et al.  Pyramid Sketch: a Sketch Framework for Frequency Estimation of Data Streams , 2017, Proc. VLDB Endow..

[26]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[27]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[28]  Haipeng Dai,et al.  Finding Persistent Items in Data Streams , 2016, Proc. VLDB Endow..

[29]  David A. Maltz,et al.  Network traffic characteristics of data centers in the wild , 2010, IMC '10.

[30]  George Varghese,et al.  Bitmap algorithms for counting active flows on high speed links , 2003, IMC '03.

[31]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[32]  Moditha Hewasinghage,et al.  A Frequent Named Entities-Based Approach for Interpreting Reputation in Twitter , 2018, Data Science and Engineering.

[33]  Steve Uhlig,et al.  HeavyKeeper: An Accurate Algorithm for Finding Top- $k$ Elephant Flows , 2019, IEEE/ACM Transactions on Networking.

[34]  Feifei Li,et al.  Persistent Bloom Filter: Membership Testing for the Entire History , 2018, SIGMOD Conference.

[35]  Edo Liberty,et al.  Optimal Quantile Approximation in Streams , 2016, 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS).

[36]  Bibudh Lahiri,et al.  Space-efficient tracking of persistent items in a massive data stream , 2011, DEBS '11.

[37]  Qin Zhang,et al.  Bias-Aware Sketches , 2016, Proc. VLDB Endow..

[38]  Konstantina Papagiannaki,et al.  Exploiting Temporal Persistence to Detect Covert Botnet Channels , 2009, RAID.

[39]  Jinde Cao,et al.  Consensus in nonlinear multi-agent systems with nonidentical nodes and sampled-data control , 2018, Science China Information Sciences.

[40]  Xiaoyong Du,et al.  Persistent Data Sketching , 2015, SIGMOD Conference.

[41]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .