A Learning-based Data Augmentation for Network Anomaly Detection

While machine learning technologies have been remarkably advanced over the past several years, one of the fundamental requirements for the success of learning-based approaches would be the availability of high-quality data that thoroughly represent individual classes in a problem space. Unfortunately, it is not uncommon to observe a significant degree of class imbalance with only a few instances for minority classes in many datasets, including network traffic traces highly skewed toward a large number of normal connections while very small in quantity for attack instances. A well-known approach to addressing the class imbalance problem is data augmentation that generates synthetic instances belonging to minority classes. However, traditional statistical techniques may be limited since the extended data through statistical sampling should have the same density as original data instances with a minor degree of variation. This paper takes a learning-based approach to data augmentation to enable effective network anomaly detection. One of the critical challenges for the learning-based approach is the mode collapse problem resulting in a limited diversity of samples, which was also observed from our preliminary experimental result. To this end, we present a novel "Divide-Augment-Combine" (DAC) strategy, which groups the instances based on their characteristics and augments data on a group basis to represent a subset independently using a generative adversarial model. Our experimental results conducted with two recently collected public network datasets (UNSW-NB15 and IDS-2017) show that the proposed technique enhances performances up to 21.5% for identifying network anomalies.

[1]  Silvana Trimi,et al.  Big-data applications in the government sector , 2014, Commun. ACM.

[2]  Mahesh Shirole,et al.  Benchmarking datasets for Anomaly-based Network Intrusion Detection: KDD CUP 99 alternatives , 2018, 2018 IEEE 3rd International Conference on Computing, Communication and Security (ICCCS).

[3]  Jameela Al-Jaroodi,et al.  Applications of big data to smart cities , 2015, Journal of Internet Services and Applications.

[4]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[5]  Jugal K. Kalita,et al.  Network Anomaly Detection: Methods, Systems and Tools , 2014, IEEE Communications Surveys & Tutorials.

[6]  Georg Langs,et al.  Unsupervised Anomaly Detection with Generative Adversarial Networks to Guide Marker Discovery , 2017, IPMI.

[7]  R. L. Thorndike Who belongs in the family? , 1953 .

[8]  Chee Peng Lim,et al.  Credit Card Fraud Detection Using AdaBoost and Majority Voting , 2019, IEEE Access.

[9]  Yuval Elovici,et al.  DOPING: Generative Data Augmentation for Unsupervised Anomaly Detection with GAN , 2018, 2018 IEEE International Conference on Data Mining (ICDM).

[10]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[11]  Jill Slay,et al.  The evaluation of Network Anomaly Detection Systems: Statistical analysis of the UNSW-NB15 data set and the comparison with the KDD99 data set , 2016, Inf. Secur. J. A Glob. Perspect..

[12]  See-Kiong Ng,et al.  Anomaly Detection with Generative Adversarial Networks for Multivariate Time Series , 2018, ArXiv.

[13]  Jinoh Kim,et al.  A survey of deep learning-based network anomaly detection , 2017, Cluster Computing.

[14]  Stephen Kwek,et al.  Applying Support Vector Machines to Imbalanced Datasets , 2004, ECML.

[15]  Ali A. Ghorbani,et al.  Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization , 2018, ICISSP.

[16]  Mohammad Khalilia,et al.  Predicting disease risks from highly imbalanced data using random forest , 2011, BMC Medical Informatics Decis. Mak..

[17]  Peter Steenkiste,et al.  Network Anomaly Detection Using Co-clustering , 2012, 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.

[18]  KimGang-Hoon,et al.  Big-data applications in the government sector , 2014 .

[19]  Nour Moustafa,et al.  UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set) , 2015, 2015 Military Communications and Information Systems Conference (MilCIS).

[20]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[21]  Terry Anthony Byrd,et al.  Big data analytics: Understanding its capabilities and potential benefits for healthcare organizations , 2018 .

[22]  Jinyan Li,et al.  Adaptive Swarm Balancing Algorithms for rare-event prediction in imbalanced healthcare data , 2017, PloS one.

[23]  Alfredo De Santis,et al.  Using generative adversarial networks for improving classification effectiveness in credit card fraud detection , 2017, Inf. Sci..

[24]  Frans Stokman,et al.  Encyclopedia of Social Network Analysis and Mining , 2014 .

[25]  Seetha Hari,et al.  Learning From Imbalanced Data , 2019, Advances in Computer and Electrical Engineering.

[26]  Chuan Sheng Foo,et al.  Efficient GAN-Based Anomaly Detection , 2018, ArXiv.