Variational data generative model for intrusion detection

A Network Intrusion Detection System is a system which detects intrusive, malicious activities or policy violations in a host or hosts network. The ability to access balanced and diversified data to train the system is very important for any detection system. Intrusion data rarely have these characteristics, since samples of network traffic are strongly biased to normal traffic, being difficult to access traffic associated with intrusion events. Therefore, it is important to have a method to synthesize intrusion data with a probabilistic and behavioral structure similar to the original one. In this work, we provide such a method. Intrusion data have continuous and categorical features, with a strongly unbalanced distribution of intrusion labels. That is the reason why we generate synthetic samples conditioned to the distribution of labels. That is, from a particular set of labels, we generate training samples associated with that set of labels, replicating the probabilistic structure of the original data that comes from those labels. We use a generative model based on a customized variational autoencoder, using the labels of the intrusion class as an additional input to the network. This modification provides an advantage, as we can readily generate new data using only the labels, without having to rely on training samples as canonical representatives for each label, which makes the generation process more reliable, less complex and faster. We show that the synthetic data are similar to the real data, and that the new synthesized data can be used to improve the performance scores of common machine learning classifiers.

[1]  Sungzoon Cho,et al.  Variational Autoencoder based Anomaly Detection using Reconstruction Probability , 2015 .

[2]  Jugal K. Kalita,et al.  Network Anomaly Detection: Methods, Systems and Tools , 2014, IEEE Communications Surveys & Tutorials.

[3]  Hui Han,et al.  Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning , 2005, ICIC.

[4]  Probal Chaudhuri,et al.  Comparison of multivariate distributions using quantile-quantile plots and related tests , 2014, 1407.1212.

[5]  Gustavo E. A. P. A. Batista,et al.  A study of the behavior of several methods for balancing machine learning training data , 2004, SKDD.

[6]  Max Welling,et al.  Auto-Encoding Variational Bayes , 2013, ICLR.

[7]  Ian Goodfellow,et al.  Generative adversarial networks , 2020, Commun. ACM.

[8]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[9]  Zhi-Hua Zhou,et al.  Exploratory Undersampling for Class-Imbalance Learning , 2009, IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics).

[10]  Haibo He,et al.  ADASYN: Adaptive synthetic sampling approach for imbalanced learning , 2008, 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence).

[11]  Murray D. Burke,et al.  On the multivariate two-sample problem using strong approximations of the EDF , 1977 .

[12]  Geoffrey E. Hinton,et al.  Autoencoders, Minimum Description Length and Helmholtz Free Energy , 1993, NIPS.

[13]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[14]  Gary M. Weiss Mining with rarity: a unifying framework , 2004, SKDD.

[15]  John W. Fisher,et al.  Estimating dependency and significance for high-dimensional data , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[16]  Phil Blunsom,et al.  Neural Variational Inference for Text Processing , 2015, ICML.

[17]  Alex Graves,et al.  DRAW: A Recurrent Neural Network For Image Generation , 2015, ICML.

[18]  Anamika Yadav,et al.  Performance analysis of NSL-KDD dataset using ANN , 2015, 2015 International Conference on Signal Processing and Communication Engineering Systems.

[19]  Haibo He,et al.  Learning from Imbalanced Data , 2009, IEEE Transactions on Knowledge and Data Engineering.

[20]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[21]  Edwin de Jonge,et al.  Visualizing and Inspecting Large Datasets with Tableplots , 2013, Journal of Data Science.

[22]  Hien M. Nguyen,et al.  Borderline over-sampling for imbalanced data classification , 2009, Int. J. Knowl. Eng. Soft Data Paradigms.

[23]  Ben Poole,et al.  Categorical Reparameterization with Gumbel-Softmax , 2016, ICLR.

[24]  Yoshua Bengio,et al.  Generative Adversarial Networks , 2014, ArXiv.

[25]  Zhiting Hu,et al.  Improved Variational Autoencoders for Text Modeling using Dilated Convolutions , 2017, ICML.

[26]  Samy Bengio,et al.  Taking on the curse of dimensionality in joint distributions using neural networks , 2000, IEEE Trans. Neural Networks Learn. Syst..

[27]  Francisco Herrera,et al.  A Review on Ensembles for the Class Imbalance Problem: Bagging-, Boosting-, and Hybrid-Based Approaches , 2012, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews).

[28]  Mahmod S. Mahmod,et al.  A COMPARISON STUDY FOR INTRUSION DATABASE (KDD99, NSL-KDD) BASED ON SELF ORGANIZATION MAP (SOM) ARTIFICIAL NEURAL NETWORK , 2013 .

[29]  R. Zamar,et al.  A multivariate Kolmogorov-Smirnov test of goodness of fit , 1997 .

[30]  Ali A. Ghorbani,et al.  A detailed analysis of the KDD CUP 99 data set , 2009, 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications.

[31]  Vern Paxson,et al.  Outside the Closed World: On Using Machine Learning for Network Intrusion Detection , 2010, 2010 IEEE Symposium on Security and Privacy.

[32]  David A. Cieslak,et al.  Combating imbalance in network intrusion datasets , 2006, 2006 IEEE International Conference on Granular Computing.