FlowFight: High performance-low memory top-k spreader detection

Abstract A recurring task in security monitoring/anomaly detection applications consists in finding the so-called top “spreaders” (“scanners”), for instance hosts which connect to a large number of distinct destinations or hit different ports. Estimating the top k scanners, and their cardinality, using the least amount of memory meanwhile running at multi-Gbps speed, is a non trivial task, as it requires to “remember” the destinations or ports already contacted in the past by each specific host. This paper proposes and assesses an innovative design, called FlowFight. As the name implies, our approach revolves on the idea of deploying a relatively small number of per-flow HyperLogLog approximate counters — only slightly superior to the target k — and involve the potentially huge number of concurrent flows in a sort of dynamic randomized “competition” for entering such set. The algorithm has been tested and integrated in a full-fledged software router such as Vector Packet Processor. Using either synthetic or real traffic traces, we show that FlowFight is able to estimate the top- k cardinality flows with an accuracy of more than 95%, while retaining a processing throughput of around 8 Mpps on a single core. We further show that FlowFight achieves same accuracy of the state of the art competitor SpreadSketch using 10x times less memory with 1.2x times higher throughput.

[1]  Moses Charikar,et al.  Finding frequent items in data streams , 2004, Theor. Comput. Sci..

[2]  Abhishek Kumar,et al.  Joint data streaming and sampling techniques for detection of super sources and destinations , 2005, IMC '05.

[3]  Dawn Xiaodong Song,et al.  New Streaming Algorithms for Fast Detection of Superspreaders , 2005, NDSS.

[4]  Daniel Ting,et al.  Streamed approximate counting of distinct elements: beating optimal batch methods , 2014, KDD.

[5]  Kyu-Young Whang,et al.  A linear-time probabilistic counting algorithm for database applications , 1990, TODS.

[6]  Yong Guan,et al.  Identifying High-Cardinality Hosts from Network-Wide Traffic Measurements , 2016, IEEE Trans. Dependable Secur. Comput..

[7]  Patrick P. C. Lee,et al.  SpreadSketch: Toward Invertible and Network-Wide Detection of Superspreaders , 2020, IEEE INFOCOM 2020 - IEEE Conference on Computer Communications.

[8]  Divyakant Agrawal,et al.  Efficient Computation of Frequent and Top-k Elements in Data Streams , 2005, ICDT.

[9]  Graham Cormode,et al.  An improved data stream summary: the count-min sketch and its applications , 2004, J. Algorithms.

[10]  Daniel Raumer,et al.  MoonGen: A Scriptable High-Speed Packet Generator , 2014, Internet Measurement Conference.

[11]  George Varghese,et al.  Bitmap algorithms for counting active flows on high speed links , 2003, IMC '03.

[12]  Xingang Shi,et al.  An online framework for catching top spreaders and scanners , 2010, Comput. Networks.

[13]  Roy Friedman,et al.  Nitrosketch: robust and general sketch-based monitoring in software switches , 2019, SIGCOMM.

[14]  Jonathan S. Turner,et al.  ClassBench: A Packet Classification Benchmark , 2005, IEEE/ACM Transactions on Networking.

[15]  Leonardo Linguaglossa,et al.  High-Speed Software Data Plane via Vectorized Packet Processing , 2018, IEEE Communications Magazine.

[16]  Peter J. Haas,et al.  On synopses for distinct-value estimation under multiset operations , 2007, SIGMOD '07.

[17]  Roy Friedman,et al.  Heavy hitters in streams and sliding windows , 2016, IEEE INFOCOM 2016 - The 35th Annual IEEE International Conference on Computer Communications.

[18]  Luca Trevisan,et al.  Counting Distinct Elements in a Data Stream , 2002, RANDOM.

[19]  Philippe Flajolet,et al.  Probabilistic Counting Algorithms for Data Base Applications , 1985, J. Comput. Syst. Sci..

[20]  Amr El Abbadi,et al.  Why go logarithmic if we can go linear?: Towards effective distinct counting of search traffic , 2008, EDBT '08.

[21]  Raja Chiky,et al.  How can sliding HyperLogLog and EWMA detect port scan attacks in IP traffic? , 2014, EURASIP J. Inf. Secur..

[22]  P. Flajolet,et al.  HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm , 2007 .

[23]  Steve Uhlig,et al.  HeavyKeeper: An Accurate Algorithm for Finding Top- $k$ Elephant Flows , 2019, IEEE/ACM Transactions on Networking.

[24]  Philippe Flajolet,et al.  Loglog Counting of Large Cardinalities (Extended Abstract) , 2003, ESA.

[25]  Min Chen,et al.  Cardinality Estimation for Elephant Flows: A Compact Solution Based on Virtual Register Sharing , 2017, IEEE/ACM Transactions on Networking.

[26]  Felix Naumann,et al.  Cardinality Estimation: An Experimental Survey , 2017, Proc. VLDB Endow..

[27]  G. Zipf Selected Studies of the Principle of Relative Frequency in Language , 2014 .

[28]  Nick McKeown,et al.  OpenFlow: enabling innovation in campus networks , 2008, CCRV.

[29]  Shigang Chen,et al.  Better with fewer bits: Improving the performance of cardinality estimation of large data streams , 2017, IEEE INFOCOM 2017 - IEEE Conference on Computer Communications.