PR-Sketch: Monitoring Per-key Aggregation of Streaming Data with Nearly Full Accuracy

Computing per-key aggregation is indispensable in streaming data analysis formulated as two phases, an update phase and a recovery phase. As the size and speed of data streams rise, accurate per-key information is useful in many applications like anomaly detection, attack prevention, and online diagnosis. Even though many algorithms have been proposed for per-key aggregation in stream processing, their accuracy guarantees only cover a small portion of keys. In this paper, we aim to achieve nearly full accuracy with limited resource usage. We follow the line of sketch-based techniques. We observe that existing methods suffer from high errors for most keys. The reason is that they track keys by complicated mechanism in the update phase and simply calculate per-key aggregation from some specific counter in the recovery phase. Therefore, we present PR-Sketch, a novel sketching design to address the two limitations. PR-Sketch builds linear equations between counter values and per-key aggregations to improve accuracy, and records keys in the recovery phase to reduce resource usage in the update phase. We also provide an extension called fast PR-Sketch to improve processing rate further. We derive space complexity, time complexity, and guaranteed error probability for both PR-Sketch and fast PRSketch. We conduct trace-driven experiments under 100K keys and 1M items to compare our algorithms with multiple state-of-the-art methods. Results demonstrate the resource efficiency and nearly full accuracy of our algorithms. PVLDB Reference Format: Siyuan Sheng, Qun Huang, Sa Wang, and Yungang Bao. PR-Sketch: Monitoring Per-key Aggregation of Streaming Data with Nearly Full Accuracy. PVLDB, 14(10): 1783-1796, 2021. doi:10.14778/3467861.3467868 PVLDB Artifact Availability: The source code, data, and/or other artifacts have been made available at https://github.com/N2-Sys/PR-Sketch/. This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 14, No. 10 ISSN 2150-8097. doi:10.14778/3467861.3467868

[1]  Fengyuan Ren,et al.  Improving ECN marking scheme with micro-burst traffic in data center networks , 2017, IEEE INFOCOM 2017 - IEEE Conference on Computer Communications.

[2]  Walter Willinger,et al.  On Unbiased Sampling for Unstructured Peer-to-Peer Networks , 2006, IEEE/ACM Transactions on Networking.

[3]  Patrick P. C. Lee,et al.  MV-Sketch: A Fast and Compact Invertible Sketch for Heavy Flow Detection in Network Data Streams , 2019, IEEE INFOCOM 2019 - IEEE Conference on Computer Communications.

[4]  Carsten Lund,et al.  Estimating flow distributions from sampled flow statistics , 2005, TNET.

[5]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[6]  Patrick P. C. Lee,et al.  LD-Sketch: A distributed sketching design for accurate and scalable anomaly detection in network data streams , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[7]  Peng Huang,et al.  Gray Failure: The Achilles' Heel of Cloud-Scale Systems , 2017, HotOS.

[8]  Patrick P. C. Lee,et al.  Sketchlearn: relieving user burdens in approximate measurement with automated statistical inference , 2018, SIGCOMM.

[9]  Burton H. Bloom,et al.  Space/time trade-offs in hash coding with allowable errors , 1970, CACM.

[10]  Lada A. Adamic,et al.  Zipf's law and the Internet , 2002, Glottometrics.

[11]  Li Fan,et al.  Summary cache: a scalable wide-area web cache sharing protocol , 2000, TNET.

[12]  Michael Mitzenmacher,et al.  Compressed bloom filters , 2001, PODC '01.

[13]  Zhi-Li Zhang,et al.  Profiling internet backbone traffic: behavior models and applications , 2005, SIGCOMM '05.

[14]  WhangKyu-Young,et al.  A linear-time probabilistic counting algorithm for database applications , 1990 .

[15]  Minlan Yu,et al.  FlowRadar: A Better NetFlow for Data Centers , 2016, NSDI.

[16]  Kyu-Young Whang,et al.  A linear-time probabilistic counting algorithm for database applications , 1990, TODS.

[17]  Jin Cao,et al.  Sequential hashing: A flexible approach for unveiling significant patterns in high speed networks , 2010, Comput. Networks.

[18]  Mark Crovella,et al.  Mining anomalies using traffic feature distributions , 2005, SIGCOMM '05.

[19]  Vern Paxson,et al.  Empirically derived analytic models of wide-area TCP connections , 1994, TNET.

[20]  Minlan Yu,et al.  DETER: Deterministic TCP Replay for Performance Diagnosis , 2019, NSDI.

[21]  R. Agarwal Fast Algorithms for Mining Association Rules , 1994, VLDB 1994.

[22]  Xin Jin,et al.  SketchVisor: Robust Network Measurement for Software Packet Processing , 2017, SIGCOMM.

[23]  Peng Liu,et al.  Elastic sketch: adaptive and fast network-wide measurements , 2018, SIGCOMM.

[24]  Abhishek Kumar,et al.  Data streaming algorithms for efficient and accurate estimation of flow size distribution , 2004, SIGMETRICS '04/Performance '04.

[25]  Moses Charikar,et al.  Finding frequent items in data streams , 2002, Theor. Comput. Sci..

[26]  George Varghese,et al.  New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice , 2003, TOCS.

[27]  Minlan Yu,et al.  LossRadar: Fast Detection of Lost Packets in Data Center Networks , 2016, CoNEXT.

[28]  Tong Yang,et al.  Pyramid Sketch: a Sketch Framework for Frequency Estimation of Data Streams , 2017, Proc. VLDB Endow..

[29]  David A. Maltz,et al.  Network traffic characteristics of data centers in the wild , 2010, IMC '10.

[30]  Brighten Godfrey,et al.  Debugging the data plane with anteater , 2011, SIGCOMM.

[31]  Andrew T. Campbell,et al.  A quality of service architecture , 1994, CCRV.

[32]  Gaogang Xie,et al.  SF-sketch: A Fast, Accurate, and Memory Efficient Data Structure to Store Frequencies of Data Items , 2017, 2017 IEEE 33rd International Conference on Data Engineering (ICDE).

[33]  Minlan Yu,et al.  Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing , 2018, SIGMOD Conference.

[34]  Michael M. Swift,et al.  Loom: Flexible and Efficient NIC Packet Scheduling , 2019, NSDI.

[35]  Roy Friedman,et al.  Nitrosketch: robust and general sketch-based monitoring in software switches , 2019, SIGCOMM.

[36]  Graham Cormode,et al.  What's new: finding significant differences in network data streams , 2004, IEEE/ACM Transactions on Networking.

[37]  Murali S. Kodialam,et al.  Traffic engineering in software defined networks , 2013, 2013 Proceedings IEEE INFOCOM.

[38]  Jeffrey Considine,et al.  Robust Aggregation in Sensor Networks , 2005, IEEE Data Eng. Bull..

[39]  Marco Canini,et al.  Per flow packet sampling for high-speed network monitoring , 2009, 2009 First International Communication Systems and Networks and Workshops.

[40]  Ion Stoica,et al.  Confluo: Distributed Monitoring and Diagnosis Stack for High-speed Networks , 2019, NSDI.

[41]  Roy Friedman,et al.  Constant Time Updates in Hierarchical Heavy Hitters , 2017, SIGCOMM.

[42]  Minlan Yu,et al.  SilkRoad: Making Stateful Layer-4 Load Balancing Fast and Cheap Using Switching ASICs , 2017, SIGCOMM.

[43]  Wei Bai,et al.  OmniMon: Re-architecting Network Telemetry with Resource Efficiency and Full Accuracy , 2020, SIGCOMM.

[44]  Vladimir Braverman,et al.  One Sketch to Rule Them All: Rethinking Network Flow Monitoring with UnivMon , 2016, SIGCOMM.

[45]  Hua Chen,et al.  Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis , 2015, SIGCOMM.

[46]  Gang Wang,et al.  Northeastern University , 2021, IEEE Pulse.

[47]  Yong Guan,et al.  A fast sketch for aggregate queries over high-speed network traffic , 2012, 2012 Proceedings IEEE INFOCOM.

[48]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[49]  Ben Y. Zhao,et al.  Packet-Level Telemetry in Large Datacenter Networks , 2015, SIGCOMM.

[50]  George Varghese,et al.  What's the difference?: efficient set reconciliation without prior context , 2011, SIGCOMM.

[51]  Srikanth Kandula,et al.  Sampling biases in network path measurements and what to do about it , 2009, IMC '09.

[52]  Tian He,et al.  kBF: A Bloom Filter for key-value storage with an application on approximate state machines , 2014, IEEE INFOCOM 2014 - IEEE Conference on Computer Communications.

[53]  Ramakrishnan Srikant,et al.  Fast algorithms for mining association rules , 1998, VLDB 1998.

[54]  Aikaterini Mitrokotsa,et al.  DDoS attacks and defense mechanisms: classification and state-of-the-art , 2004, Comput. Networks.

[55]  V. Paxson,et al.  Wide-area traffic: the failure of Poisson modeling , 1994, SIGCOMM.