Randomized Error Removal for Online Spread Estimation in Data Streaming

Measuring flow spread in real time from large, high-rate data streams has numerous practical applications, where a data stream is modeled as a sequence of data items from different flows and the spread of a flow is the number of distinct items in the flow. Past decades have witnessed tremendous performance improvement for single-flow spread estimation. However, when dealing with numerous flows in a data stream, it remains a significant challenge to measure per-flow spread accurately while reducing memory footprint. The goal of this paper is to introduce new multi-flow spread estimation designs that incur much smaller processing overhead and query overhead than the state of the art, yet achieves significant accuracy improvement in spread estimation. We formally analyze the performance of these new designs. We implement them in both hardware and software, and use real-world data traces to evaluate their performance in comparison with the state of the art. The experimental results show that our best sketch significantly improves over the best existing work in terms of estimation accuracy, data item processing throughput, and online query throughput. PVLDB Reference Format: Haibo Wang, Chaoyi Ma, Olufemi O Odegbile, Shigang Chen, and Jih-Kwon Peir. Randomized Error Removal for Online Spread Estimation in Data Streaming. PVLDB, 14(6): 1040-1052, 2021. doi:10.14778/3447689.3447707

[1]  J. Alex Halderman,et al.  An Internet-Wide View of Internet-Wide Scanning , 2014, USENIX Security Symposium.

[2]  Xin Jin,et al.  SketchVisor: Robust Network Measurement for Software Packet Processing , 2017, SIGCOMM.

[3]  Andrey Gubarev,et al.  Dremel : Interactive Analysis of Web-Scale Datasets , 2011 .

[4]  You Zhou,et al.  Generalized Sketch Families for Network Traffic Measurement , 2019, Proc. ACM Meas. Anal. Comput. Syst..

[5]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[6]  George Varghese,et al.  Automated Worm Fingerprinting , 2004, OSDI.

[7]  Vladimir Braverman,et al.  One Sketch to Rule Them All: Rethinking Network Flow Monitoring with UnivMon , 2016, SIGCOMM.

[8]  Aiko Pras,et al.  An Overview of IP Flow-Based Intrusion Detection , 2010, IEEE Communications Surveys & Tutorials.

[9]  Peter Reiher,et al.  A taxonomy of DDoS attack and DDoS defense mechanisms , 2004, CCRV.

[10]  Srinivasan Seshan,et al.  Detecting DDoS Attacks on ISP Networks , 2003 .

[11]  Peng Liu,et al.  Elastic sketch: adaptive and fast network-wide measurements , 2018, SIGCOMM.

[12]  Jia Wang,et al.  Analyzing peer-to-peer traffic across large networks , 2002, IMW '02.

[13]  Ramesh Govindan,et al.  SCREAM: sketch resource allocation for software-defined measurement , 2015, CoNEXT.

[14]  George Varghese,et al.  Bitmap algorithms for counting active flows on high speed links , 2003, IMC '03.

[15]  Alexander Hall,et al.  HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm , 2013, EDBT '13.

[16]  Minlan Yu,et al.  Cold Filter: A Meta-Framework for Faster and More Accurate Stream Processing , 2018, SIGMOD Conference.

[17]  Roy Friedman,et al.  Nitrosketch: robust and general sketch-based monitoring in software switches , 2019, SIGCOMM.

[18]  Graham Cormode,et al.  Estimating Dominance Norms of Multiple Data Streams , 2003, ESA.

[19]  Patrick P. C. Lee,et al.  SpreadSketch: Toward Invertible and Network-Wide Detection of Superspreaders , 2020, IEEE INFOCOM 2020 - IEEE Conference on Computer Communications.

[20]  Alexander Hall,et al.  Processing a Trillion Cells per Mouse Click , 2012, Proc. VLDB Endow..

[21]  Minlan Yu,et al.  Software Defined Traffic Measurement with OpenSketch , 2013, NSDI.

[22]  Peter J. Haas,et al.  On synopses for distinct-value estimation under multiset operations , 2007, SIGMOD '07.

[23]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .

[24]  Rajeev Rastogi,et al.  Streaming Algorithms for Robust, Real-Time Detection of DDoS Attacks , 2007, 27th International Conference on Distributed Computing Systems (ICDCS '07).

[25]  Björn Scheuermann,et al.  High-Speed Per-Flow Traffic Measurement with Probabilistic Multiplicity Counting , 2010, 2010 Proceedings IEEE INFOCOM.

[26]  Elke A. Rundensteiner,et al.  QRelX: generating meaningful queries that provide cardinality assurance , 2010, SIGMOD Conference.

[27]  Min Chen,et al.  Hyper-Compact Virtual Estimators for Big Network Data Based on Register Sharing , 2015, SIGMETRICS.

[28]  Jie Jiang,et al.  Fine-grained probability counting for cardinality estimation of data streams , 2019, World Wide Web.

[29]  Xiang Liu,et al.  Web caching for database applications with Oracle Web Cache , 2002, SIGMOD '02.

[30]  Antoon Bronselaer,et al.  Estimation of topic cardinality in document collections , 2010 .

[31]  P. Flajolet,et al.  Loglog counting of large cardinalities , 2003 .

[32]  Rob Pike,et al.  Interpreting the data: Parallel analysis with Sawzall , 2005, Sci. Program..

[33]  Yong Tang,et al.  Slowing down Internet worms , 2004, 24th International Conference on Distributed Computing Systems, 2004. Proceedings..

[34]  Jih-Kwon Peir,et al.  Fit a Spread Estimator in Small Memory , 2009, IEEE INFOCOM 2009.

[35]  Liusheng Huang,et al.  CountMax: A Lightweight and Cooperative Sketch Measurement for Software-Defined Networks , 2018, IEEE/ACM Transactions on Networking.

[36]  Graham Cormode,et al.  Space efficient mining of multigraph streams , 2005, PODS.

[37]  Min Chen,et al.  Persistent Spread Measurement for Big Network Data Based on Register Intersection , 2017, SIGMETRICS.

[38]  Yang Liu,et al.  Identifying high-cardinality hosts from network-wide traffic measurements , 2013, 2013 IEEE Conference on Communications and Network Security (CNS).

[39]  Dawn Xiaodong Song,et al.  New Streaming Algorithms for Fast Detection of Superspreaders , 2005, NDSS.

[40]  Gustavo Alonso,et al.  Augmented Sketch: Faster and More Accurate Stream Processing , 2016, SIGMOD Conference.

[41]  Minlan Yu,et al.  FlowRadar: A Better NetFlow for Data Centers , 2016, NSDI.

[42]  Kyu-Young Whang,et al.  A linear-time probabilistic counting algorithm for database applications , 1990, TODS.