Flow-based benchmark data sets for intrusion detection

Anomaly based intrusion detection systems suffer from a lack of appropriate evaluation data sets. Often, existing data sets may not be published due to privacy concerns or do not reflect actual and current attack scenarios. In order to overcome these problems, we identify characteristics of good data sets and develop an appropriate concept for the generation of labelled flow-based data sets that satisfy these criteria. The concept is implemented based on OpenStack, thus demonstrating the suitability of virtual environments. Virtual environments offer advantages compared to static data sets by easily creating up-to-date data sets with recent trends in user behaviour and new attack scenarios. In particular, we emulate a small business environment which includes several clients and typical servers. Network traffic is generated by scripts which emulate typical user activities like surfing the web, writing emails, or printing documents on the clients. These scripts follow some guidelines to ensure that the user behaviour is as realistic as possible, also with respect to working hours and lunch breaks. The generated network traffic is recorded in unidirectional NetFlow format. For generating malicious traffic, attacks like Denial of Service, Brute Force, and Port Scans are executed within the network. Since origins, targets, and timestamps of executed attacks are known, labelling of recorded NetFlow data is easily possible. For inclusion of actual traffic, which has its origin outside the OpenStack environment, an external server with two services is deployed. This server has a public IP address and is exposed to real and up-to-date attacks from the internet. We captured approximately 32 million flows over a period of four weeks and categorized them into five classes. Further, the chronological sequence of the flows is analysed and the distribution of normal and malicious traffic is discussed in detail. The main contribution of this paper is the demonstration of a novel approach to use OpenStack as a basis for generating realistic data sets that can be used for the evaluation of network intrusion detection systems.

[1]  Ali A. Ghorbani,et al.  A detailed analysis of the KDD CUP 99 data set , 2009, 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications.

[2]  Vern Paxson,et al.  Outside the Closed World: On Using Machine Learning for Network Intrusion Detection , 2010, 2010 IEEE Symposium on Security and Privacy.

[3]  Taghi M. Khoshgoftaar,et al.  A New Intrusion Detection Benchmarking System , 2015, FLAIRS Conference.

[4]  Max Mühlhäuser,et al.  Towards the creation of synthetic, yet realistic, intrusion detection datasets , 2016, NOMS 2016 - 2016 IEEE/IFIP Network Operations and Management Symposium.

[5]  Kensuke Fukuda,et al.  MAWILab: combining diverse anomaly detectors for automated anomaly labeling and performance benchmarking , 2010, CoNEXT.

[6]  Andreas Hotho,et al.  A Toolset for Intrusion and Insider Threat Detection , 2017 .

[7]  Ali A. Ghorbani,et al.  Toward developing a systematic approach to generate benchmark datasets for intrusion detection , 2012, Comput. Secur..

[8]  Florian Otto Creation of specific flow-based training data sets for usage behaviour classification , 2016 .

[9]  Aiko Pras,et al.  A Labeled Data Set for Flow-Based Intrusion Detection , 2009, IPOM.

[10]  Dieter Landes,et al.  Identifying Suspicious Activities in Company Networks Through Data Mining and Visualization , 2013 .

[11]  M. Malowidzki,et al.  Network Intrusion Detection : Half a Kingdom for a Good Dataset , 2015 .

[12]  Taghi M. Khoshgoftaar,et al.  A Session Based Approach for Aggregating Network Traffic Data -- The SANTA Dataset , 2014, 2014 IEEE International Conference on Bioinformatics and Bioengineering.

[13]  Jeffrey L. Hieb,et al.  Cyber security risk assessment for SCADA and DCS networks. , 2007, ISA transactions.