Traffic Generation using Containerization for Machine Learning

The design and evaluation of data-driven network intrusion detection methods are currently held back by a lack of adequate data, both in terms of benign and attack traffic. Existing datasets are mostly gathered in isolated lab environments containing virtual machines, to both offer more control over the computer interactions and prevent any malicious code from escaping. This procedure however leads to datasets that lack four core properties: heterogeneity, ground truth traffic labels, large data size, and contemporary content. Here, we present a novel data generation framework based on Docker containers that addresses these problems systematically. For this, we arrange suitable containers into relevant traffic communication scenarios and subscenarios, which are subject to appropriate input randomization as well as WAN emulation. By relying on process isolation through containerization, we can match traffic events with individual processes, and achieve scalability and modularity of individual traffic scenarios. We perform two experiments to assess the reproducability and traffic properties of our framework, and demonstrate the usefulness of our framework on a traffic classification example.

[1]  Roberto Therón,et al.  UGR'16: A new dataset for the evaluation of cyclostationarity-based network IDSs , 2018, Comput. Secur..

[2]  Chadi Barakat,et al.  Can We Trust the Inter-Packet Time for Traffic Classification? , 2011, 2011 IEEE International Conference on Communications (ICC).

[3]  Malcolm I. Heywood,et al.  Selecting Features for Intrusion Detection: A Feature Relevance Analysis on KDD 99 , 2005, PST.

[4]  Yogesh L. Simmhan,et al.  VIoLET: A Large-scale Virtual Environment for Internet of Things , 2018, Euro-Par.

[5]  Alexander D. Kent,et al.  Unified Host and Network Data Set , 2017, Security Science and Technology.

[6]  Renata Teixeira,et al.  Traffic classification on the fly , 2006, CCRV.

[7]  S. Hemminger Network Emulation with NetEm , 2022 .

[8]  Nour Moustafa,et al.  UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set) , 2015, 2015 Military Communications and Information Systems Conference (MilCIS).

[9]  Ali A. Ghorbani,et al.  Towards a Reliable Intrusion Detection Benchmark Dataset , 2017 .

[10]  Aiko Pras,et al.  A Labeled Data Set for Flow-Based Intrusion Detection , 2009, IPOM.

[11]  Nevil Brownlee,et al.  Traffic Flow Measurement: Architecture , 1999, RFC.

[12]  Lior Rokach,et al.  SherLock vs Moriarty: A Smartphone Dataset for Cybersecurity Research , 2016, AISec@CCS.

[13]  Grenville J. Armitage,et al.  A survey of techniques for internet traffic classification using machine learning , 2008, IEEE Communications Surveys & Tutorials.

[14]  John McHugh,et al.  Testing Intrusion detection systems: a critique of the 1998 and 1999 DARPA intrusion detection system evaluations as performed by Lincoln Laboratory , 2000, TSEC.

[15]  Sebastian Zander,et al.  Automated traffic classification and application identification using machine learning , 2005, The IEEE Conference on Local Computer Networks 30th Anniversary (LCN'05)l.

[16]  Alexander D. Kent,et al.  Cyber security data sources for dynamic network research , 2016 .

[17]  Radek Fujdiak,et al.  IP Traffic Generator Using Container Virtualization Technology , 2018, 2018 10th International Congress on Ultra Modern Telecommunications and Control Systems and Workshops (ICUMT).

[18]  Ali A. Ghorbani,et al.  A detailed analysis of the KDD CUP 99 data set , 2009, 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications.

[19]  Vern Paxson,et al.  Outside the Closed World: On Using Machine Learning for Network Intrusion Detection , 2010, 2010 IEEE Symposium on Security and Privacy.

[20]  Sally Floyd,et al.  Wide-area traffic: the failure of Poisson modeling , 1994 .

[21]  Cynthia A. Phillips,et al.  Virtually the Same: Comparing Physical and Virtual Testbeds , 2019, 2019 International Conference on Computing, Networking and Communications (ICNC).

[22]  F. Massey The Kolmogorov-Smirnov Test for Goodness of Fit , 1951 .

[23]  Andreas Willig,et al.  The role of the Weibull distribution in Internet traffic modeling , 2013, Proceedings of the 2013 25th International Teletraffic Congress (ITC).

[24]  Ali A. Ghorbani,et al.  Toward developing a systematic approach to generate benchmark datasets for intrusion detection , 2012, Comput. Secur..

[25]  Tom M. Kroeger,et al.  Lessons Learned from 10k Experiments to Compare Virtual and Physical Testbeds , 2019, CSET @ USENIX Security Symposium.

[26]  Akira Kato,et al.  Traffic Data Repository at the WIDE Project , 2000, USENIX Annual Technical Conference, FREENIX Track.