Balancing Approaches towards ML for IDS: A Survey for the CSE-CIC IDS Dataset

Balanced datasets play a key role in the bias observed in machine learning algorithms towards classification and prediction. The CSE-CIC IDS datasets published in 2017 and 2018 have both attracted considerable scholarly attention towards research in intrusion detection systems. Recent work published using this dataset indicates little attention paid to the imbalance of the dataset. The study presented in this paper sets out to explore the degree to which imbalance has been treated and provide a taxonomy of the machine learning approaches developed using these datasets. A survey of published works related to these datasets was done to deliver a combined qualitative and quantitative methodological approach for our analysis towards deriving a taxonomy. The research presented here confirms that the impact of bias due to the imbalance datasets is rarely addressed. This data supports further research and development of supervised machine learning techniques which reduce the impact of bias in classification or prediction due to these imbalance datasets.

[1]  Zeynep Turgut,et al.  Intrusion Detection System with Recursive Feature Elimination by Using Random Forest and Deep Learning Classifier , 2018, 2018 International Congress on Big Data, Deep Learning and Fighting Cyber Terrorism (IBIGDELFT).

[2]  Ali A. Ghorbani,et al.  Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization , 2018, ICISSP.

[3]  Morufu Olalere,et al.  Enhanced Decision Tree-J48 With SMOTE Machine Learning Algorithm for Effective Botnet Detection in Imbalance Dataset , 2019, 2019 15th International Conference on Electronics, Computer and Computation (ICECCO).

[4]  Siti Mariyam Shamsuddin,et al.  Classification with class imbalance problem: A review , 2015, SOCO 2015.

[5]  Robert S. Laramee,et al.  How to Write a Visualization Survey Paper: A Starting Point , 2019, Eurographics.

[6]  D. Chicco,et al.  The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation , 2020, BMC Genomics.

[7]  Yuefei Zhu,et al.  A Deep Learning Approach for Intrusion Detection Using Recurrent Neural Networks , 2017, IEEE Access.

[8]  Erik Wilde,et al.  Academic Search Engine Optimization (ASEO) , 2010 .

[9]  Wei-Yang Lin,et al.  Intrusion detection by machine learning: A review , 2009, Expert Syst. Appl..

[10]  Atsuto Maki,et al.  A systematic study of the class imbalance problem in convolutional neural networks , 2017, Neural Networks.

[11]  V A Chastikova,et al.  Method of analyzing computer traffic based on recurrent neural networks , 2019, Journal of Physics: Conference Series.

[12]  Bruno Volckaert,et al.  Classification Hardness for Supervised Learners on 20 Years of Intrusion Detection Data , 2019, IEEE Access.

[13]  Ruchika Malhotra,et al.  An empirical study to investigate oversampling methods for improving software defect prediction using imbalanced data , 2019, Neurocomputing.

[14]  JooHwa Lee,et al.  AE-CGAN Model based High Performance Network Intrusion Detection System , 2019, Applied Sciences.

[15]  Venkatesh Jaganathan,et al.  Using a Prediction Model to Manage Cyber Security Threats , 2015, TheScientificWorldJournal.

[16]  Samarjeet Borah,et al.  Dual-stage intrusion detection for class imbalance scenarios , 2019 .

[17]  Ing-Ray Chen,et al.  A survey of intrusion detection techniques for cyber-physical systems , 2014, ACM Comput. Surv..

[18]  Iqbal Gondal,et al.  Survey of intrusion detection systems: techniques, datasets and challenges , 2019, Cybersecurity.

[19]  Vern Paxson,et al.  Outside the Closed World: On Using Machine Learning for Network Intrusion Detection , 2010, 2010 IEEE Symposium on Security and Privacy.

[20]  Ozgur Koray Sahingoz,et al.  Increasing the Performance of Machine Learning-Based IDSs on an Imbalanced and Up-to-Date Dataset , 2020, IEEE Access.

[21]  Xu Chen,et al.  Network Intrusion Detection: Based on Deep Hierarchical Network and Original Flow Data , 2019, IEEE Access.

[22]  Taghi M. Khoshgoftaar,et al.  Learning with limited minority class data , 2007, Sixth International Conference on Machine Learning and Applications (ICMLA 2007).

[23]  M. Abdulraheem,et al.  A DETAILED ANALYSIS OF NEW INTRUSION DETECTION DATASET , 2019 .