Intrusion Detection Using Big Data and Deep Learning Techniques

In this paper, Big Data and Deep Learning Techniques are integrated to improve the performance of intrusion detection systems. Three classifiers are used to classify network traffic datasets, and these are Deep Feed-Forward Neural Network (DNN) and two ensemble techniques, Random Forest and Gradient Boosting Tree (GBT). To select the most relevant attributes from the datasets, we use a homogeneity metric to evaluate features. Two recently published datasets UNSW NB15 and CICIDS2017 are used to evaluate the proposed method. 5-fold cross validation is used in this work to evaluate the machine learning models. We implemented the method using the distributed computing environment Apache Spark, integrated with Keras Deep Learning Library to implement the deep learning technique while the ensemble techniques are implemented using Apache Spark Machine Learning Library. The results show a high accuracy with DNN for binary and multiclass classification on UNSW NB15 dataset with accuracies at 99.16% for binary classification and 97.01% for multiclass classification. While GBT classifier achieved the best accuracy for binary classification with the CICIDS2017 dataset at 99.99%, for multiclass classification DNN has the highest accuracy with 99.56%.

[1]  Salah El Hadaj,et al.  A Two-Stage Classifier Approach using RepTree Algorithm for Network Intrusion Detection , 2017 .

[2]  Ali A. Ghorbani,et al.  Characterization of Tor Traffic using Time based Features , 2017, ICISSP.

[3]  Taghi M. Khoshgoftaar,et al.  Intrusion detection and Big Heterogeneous Data: a Survey , 2015, Journal of Big Data.

[4]  Devesh Kumar Srivastava,et al.  Network Intrusion Detection in Big Dataset Using Spark , 2018 .

[5]  Hossein Gharaee,et al.  A new feature selection IDS based on genetic algorithm and SVM , 2016, 2016 8th International Symposium on Telecommunications (IST).

[6]  André C. Drummond,et al.  Adaptive anomaly‐based intrusion detection system using genetic algorithm and profiling , 2018, Secur. Priv..

[7]  Jill Slay,et al.  The evaluation of Network Anomaly Detection Systems: Statistical analysis of the UNSW-NB15 data set and the comparison with the KDD99 data set , 2016, Inf. Secur. J. A Glob. Perspect..

[8]  Reynold Xin,et al.  Apache Spark , 2016 .

[9]  Ali A. Ghorbani,et al.  Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization , 2018, ICISSP.

[10]  Nour Moustafa,et al.  UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set) , 2015, 2015 Military Communications and Information Systems Conference (MilCIS).

[11]  Bayu Adhi Tama,et al.  Anomaly detection using random forest: A performance revisited , 2017, 2017 International Conference on Data and Software Engineering (ICoDSE).

[12]  Shan Suthaharan,et al.  Big data classification: problems and challenges in network intrusion prediction with machine learning , 2014, PERV.

[13]  S. P. Shantharajah,et al.  A Study on NSL-KDD Dataset for Intrusion Detection System Based on Classification Algorithms , 2015 .

[14]  J. Friedman Stochastic gradient boosting , 2002 .

[15]  Govind P. Gupta,et al.  A Framework for Fast and Efficient Cyber Security Network Intrusion Detection Using Apache Spark , 2016 .

[16]  Salah El Hadaj,et al.  Performance evaluation of intrusion detection based on machine learning using Apache Spark , 2018 .

[17]  Elif Derya Übeyli,et al.  Automatic Detection of Erythemato-Squamous Diseases Using k-Means Clustering , 2010, Journal of Medical Systems.

[18]  Julia Hirschberg,et al.  V-Measure: A Conditional Entropy-Based External Cluster Evaluation Measure , 2007, EMNLP.

[19]  Ali A. Ghorbani,et al.  Towards a Reliable Intrusion Detection Benchmark Dataset , 2017 .

[20]  R. Vijayanand,et al.  Intrusion detection system for wireless mesh network using multiple support vector machine classifiers with genetic-algorithm-based feature selection , 2018, Comput. Secur..

[21]  Ali A. Ghorbani,et al.  A detailed analysis of the KDD CUP 99 data set , 2009, 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications.

[22]  A. N. Zincir-Heywood,et al.  Intrusion Detection Systems , 2008 .

[23]  Hairong Kuang,et al.  The Hadoop Distributed File System , 2010, 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST).

[24]  Jürgen Schmidhuber,et al.  Deep learning in neural networks: An overview , 2014, Neural Networks.

[25]  Guan Le,et al.  Survey on NoSQL database , 2011, 2011 6th International Conference on Pervasive Computing and Applications.

[26]  Erdogan Dogdu,et al.  A Deep Neural-Network Based Stock Trading System Based on Evolutionary Optimized Technical Analysis Parameters , 2017 .

[27]  Roberto Di Pietro,et al.  Intrusion Detection Systems , 2008 .

[28]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[29]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[30]  Cha Zhang,et al.  Ensemble Machine Learning: Methods and Applications , 2012 .

[31]  S. Gajendran A Survey on NoSQL Databases , 2012 .

[32]  Liu Yingchun,et al.  Random forest algorithm in big data environment , 2014 .

[33]  Michel Verleysen,et al.  Cluster homogeneity as a semi-supervised principle for feature selection using mutual information , 2012, ESANN.

[34]  Pete Wyckoff,et al.  Hive - A Warehousing Solution Over a Map-Reduce Framework , 2009, Proc. VLDB Endow..

[35]  Arafat Awajan,et al.  Experimental Evaluation of a Multi-layer Feed-Forward Artificial Neural Network Classifier for Network Intrusion Detection System , 2017, 2017 International Conference on New Trends in Computing Sciences (ICTCS).

[36]  Paul Zikopoulos,et al.  Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data , 2011 .