A performance benchmark for NetFlow data analysis on distributed stream processing systems

Modern distributed stream processing systems can potentially be applied to real time network flow processing. However, differences in performance make some systems more suitable than others for being applied to this domain. We propose a novel performance benchmark, which is based on common security analysis algorithms of NetFlow data to determine the suitability of distributed stream processing systems. Three of the most used distributed stream processing systems are bench-marked and the results are compared with NetFlow data processing challenges and requirements. The benchmark results show that each system reached a sufficient data processing speed using a basic deployment scenario with little to no configuration tuning. Our benchmark, unlike any other, enables the performance of small structured messages to be processed on any stream processing system.

[1]  Michael E. Papka,et al.  The web page , 2000 .

[2]  Scott Shenker,et al.  Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters , 2012, HotCloud.

[3]  Aiko Pras,et al.  The Network Data Handling War: MySQL vs. NfDump , 2010, EUNICE.

[4]  Gang Wu,et al.  Stream Bench: Towards Benchmarking Modern Distributed Stream Computing Frameworks , 2014, 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing.

[5]  Leonardo Neumeyer,et al.  S4: Distributed Stream Computing Platform , 2010, 2010 IEEE International Conference on Data Mining Workshops.

[6]  Tom White,et al.  Hadoop: The Definitive Guide , 2009 .

[7]  Aiko Pras,et al.  An Overview of IP Flow-Based Intrusion Detection , 2010, IEEE Communications Surveys & Tutorials.

[8]  Jay Kreps,et al.  Kafka : a Distributed Messaging System for Log Processing , 2011 .

[9]  Radu State,et al.  A Big Data Architecture for Large Scale Security Monitoring , 2014, 2014 IEEE International Congress on Big Data.

[10]  Kensuke Fukuda,et al.  Hashdoop: A MapReduce framework for network anomaly detection , 2014, 2014 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Youngseok Lee,et al.  Toward scalable internet traffic measurement and analysis with Hadoop , 2013, CCRV.

[13]  Scott Shenker,et al.  Spark: Cluster Computing with Working Sets , 2010, HotCloud.

[14]  Aiko Pras,et al.  Flow Monitoring Explained: From Packet Capture to Data Analysis With NetFlow and IPFIX , 2014, IEEE Communications Surveys & Tutorials.