Big Data Network Flow Processing Using Apache Spark

The increasing amount of traffic flows captured as a part of network monitoring activities makes the analysis more complicated. One of the goals for network traffic analysis is to identify malicious communication. In the paper, we present a new system for big data network flow classification and clustering. The proposed system is based on the popular big data engines such as Apache Spark and Apache Ignite. The conducted experiments demonstrate the feasibility of the proposed approach and show the possible scalability.

[1]  Giovane C. M. Moura,et al.  ENTRADA: A high-performance network traffic data streaming warehouse , 2016, NOMS 2016 - 2016 IEEE/IFIP Network Operations and Management Symposium.

[2]  Eben Hewitt Cassandra - The Definitive Guide: Distributed Data at Web Scale , 2011 .

[3]  X. Zhou,et al.  Exploring Netflow Data using Hadoop , 2014 .

[4]  Reynold Xin,et al.  Apache Spark , 2016 .

[5]  M. Zaharia,et al.  Spark: The Definitive Guide: Big Data Processing Made Simple , 2018 .

[6]  Marek Rychl,et al.  Big Data Security Analysis withTARZAN Platform , 2018 .

[8]  Vladimir S. Zaborovsky,et al.  Distributed Packet Trace Processing Method for Information Security Analysis , 2014, NEW2AN.

[9]  Youngseok Lee,et al.  Toward scalable internet traffic measurement and analysis with Hadoop , 2013, CCRV.

[10]  Alvaro A. Cárdenas,et al.  Big Data Analytics for Security , 2013, IEEE Security & Privacy.

[11]  Xiao Liu,et al.  Hobbits: Hadoop and Hive based Internet traffic analysis , 2016, 2016 IEEE International Conference on Big Data (Big Data).

[12]  William Emmanuel Yu,et al.  Towards Large Scale Packet Capture and Network Flow Analysis on Hadoop , 2018, 2018 Sixth International Symposium on Computing and Networking Workshops (CANDARW).