DAPT 2020 - Constructing a Benchmark Dataset for Advanced Persistent Threats

Machine learning is being embraced by information security researchers and organizations alike for its potential in detecting attacks that an organization faces, specifically attacks that go undetected by traditional signature-based intrusion detection systems. Along with the ability to process large amounts of data, machine learning brings the potential to detect contextual and collective anomalies, an essential attribute of an ideal threat detection system. Datasets play a vital role in developing machine learning models that are capable of detecting complex and sophisticated threats like Advanced Persistent Threats (APT). However, there is currently no APT-dataset that can be used for modeling and detecting APT attacks. Characterized by the sophistication involved and the determined nature of the APT attackers, these threats are not only difficult to detect but also to model. Generic intrusion datasets have three key limitations (1) They capture attack traffic at the external endpoints, limiting their usefulness in the context of APTs which comprise of attack vectors within the internal network as well (2) The difference between normal and anomalous behavior is quiet distinguishable in these datasets and thus fails to represent the sophisticated attackers’ of APT attacks (3) The data imbalance in existing datasets do not reflect the real-world settings rendering themselves as a benchmark for supervised models and falling short of semi-supervised learning. To address these concerns, in this paper, we propose a dataset DAPT 2020 which consists of attacks that are part of Advanced Persistent Threats (APT). These attacks (1) are hard to distinguish from normal traffic flows but investigate the raw feature space and (2) comprise of traffic on both public-to-private interface and the internal (private) network. Due to the existence of severe class imbalance, we benchmark DAPT 2020 dataset on semi-supervised models and show that they perform poorly trying to detect attack traffic in the various stages of an APT.

[1]  Martin Roesch,et al.  Snort - Lightweight Intrusion Detection for Networks , 1999 .

[2]  Robert K. Cunningham,et al.  Evaluating Intrusion Detection Systems Without Attacking Your Friends: The 1998 DARPA Intrusion Detection Evaluation , 1999 .

[3]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[4]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.

[5]  Kensuke Fukuda,et al.  MAWILab: combining diverse anomaly detectors for automated anomaly labeling and performance benchmarking , 2010, CoNEXT.

[6]  Pascal Vincent,et al.  Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion , 2010, J. Mach. Learn. Res..

[7]  Thomas M. Chen,et al.  Lessons from Stuxnet , 2011, Computer.

[8]  Levente Buttyán,et al.  The Cousins of Stuxnet: Duqu, Flame, and Gauss , 2012, Future Internet.

[9]  Ali A. Ghorbani,et al.  Toward developing a systematic approach to generate benchmark datasets for intrusion detection , 2012, Comput. Secur..

[10]  Levente Buttyán,et al.  Duqu: Analysis, Detection, and Lessons Learned , 2012 .

[11]  Dijiang Huang,et al.  NICE: Network Intrusion Detection and Countermeasure Selection in Virtual Network Systems , 2013, IEEE Transactions on Dependable and Secure Computing.

[12]  Thomas G. Dietterich,et al.  Systematic construction of anomaly detection benchmarks from real data , 2013, ODD '13.

[13]  Dirk Merkel,et al.  Docker: lightweight Linux containers for consistent development and deployment , 2014 .

[14]  Richard Kissel,et al.  Glossary of Key Information Security Terms , 2014 .

[15]  Sungzoon Cho,et al.  Variational Autoencoder based Anomaly Detection using Reconstruction Probability , 2015 .

[16]  Nour Moustafa,et al.  UNSW-NB15: a comprehensive data set for network intrusion detection systems (UNSW-NB15 network data set) , 2015, 2015 Military Communications and Information Systems Conference (MilCIS).

[17]  B. Wu,et al.  Detecting APT Malware Infections Based on Malicious DNS and Traffic Analysis , 2015, IEEE Access.

[18]  Jonghyun Kim,et al.  Behavior-based anomaly detection on big data , 2015 .

[19]  S. P. Shantharajah,et al.  A Study on NSL-KDD Dataset for Intrusion Detection System Based on Classification Algorithms , 2015 .

[20]  Yao Wang,et al.  A deep learning approach for detecting malicious JavaScript code , 2016, Secur. Commun. Networks.

[21]  Michele Colajanni,et al.  Analysis of high volumes of network traffic for Advanced Persistent Threat detection , 2016, Comput. Networks.

[22]  Jong Hyuk Park,et al.  A comprehensive study on APT attacks and countermeasures for future networks and communications: challenges and solutions , 2019, The Journal of Supercomputing.

[23]  William H. Sanders,et al.  Intrusion detection in enterprise systems by combining and clustering diverse monitor data , 2016, HotSoS.

[24]  Jarke J. van Wijk,et al.  Understanding the context of network traffic alerts , 2016, 2016 IEEE Symposium on Visualization for Cyber Security (VizSec).

[25]  Witold Kinsner,et al.  Detecting Advanced Persistent Threats using Fractal Dimension based Machine Learning Classification , 2016, IWSPA@CODASPY.

[26]  Michele Colajanni,et al.  Countering Advanced Persistent Threats through security intelligence and big data analytics , 2016, 2016 8th International Conference on Cyber Conflict (CyCon).

[27]  Xiaoyong Yuan PhD Forum: Deep Learning-Based Real-Time Malware Detection with Multi-Stage Analysis , 2017, 2017 IEEE International Conference on Smart Computing (SMARTCOMP).

[28]  Yulei Rao,et al.  A deep learning framework for financial time series using stacked autoencoders and long-short term memory , 2017, PloS one.

[29]  Ram Shankar Siva Kumar,et al.  Practical Machine Learning for Cloud Intrusion Detection: Challenges and the Way Forward , 2017, AISec@CCS.

[30]  Amos J. Storkey,et al.  Data Augmentation Generative Adversarial Networks , 2017, ICLR 2018.

[31]  Feifei Li,et al.  DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning , 2017, CCS.

[32]  Amos J. Storkey,et al.  Augmenting Image Classifiers Using Data Augmentation Generative Adversarial Networks , 2018, ICANN.

[33]  Xiaohui Song,et al.  A Unsupervised Learning Method of Anomaly Detection Using GRU , 2018, 2018 IEEE International Conference on Big Data and Smart Computing (BigComp).

[34]  Khaled M. Rabie,et al.  Detection of advanced persistent threat using machine-learning correlation analysis , 2018, Future Gener. Comput. Syst..

[35]  Sailik Sengupta,et al.  Markov Game Modeling of Moving Target Defense for Strategic Detection of Threats in Cloud Networks , 2018, ArXiv.

[36]  Ali A. Ghorbani,et al.  A Detailed Analysis of the CICIDS2017 Data Set , 2018, ICISSP.

[37]  Dijiang Huang,et al.  A Survey on Advanced Persistent Threats: Techniques, Solutions, Challenges, and Research Opportunities , 2019, IEEE Communications Surveys & Tutorials.

[38]  V. N. Venkatakrishnan,et al.  HOLMES: Real-Time APT Detection through Correlation of Suspicious Information Flows , 2018, 2019 IEEE Symposium on Security and Privacy (SP).

[39]  Luca Benini,et al.  A semisupervised autoencoder-based approach for anomaly detection in high performance computing systems , 2019, Eng. Appl. Artif. Intell..

[40]  Chunhua Shen,et al.  Weakly-supervised Deep Anomaly Detection with Pairwise Relation Learning , 2019, ArXiv.

[41]  Sailik Sengupta,et al.  General Sum Markov Games for Strategic Detection of Advanced Persistent Threats Using Moving Target Defense in Cloud Networks , 2019, GameSec.

[42]  Sailik Sengupta,et al.  Imperfect ImaGANation: Implications of GANs Exacerbating Biases on Facial Data Augmentation and Snapchat Selfie Lenses , 2020, ArXiv.