A New Labeled Flow-based DNS Dataset for Anomaly Detection: PUF Dataset

Abstract Flow-based anomaly detection is gaining momentum because it can be deployed for real time detection as it analyses only packet headers. To evaluate anomaly detection techniques, labeled dataset is required as unlabeled dataset is not useful for the evaluation. Many packet based network traffic datasets are available but flow-based datasets are sparsely available. In this paper, we present a comprehensive review of the existing flow-based datasets, making emphasis on their main shortcomings. Then a new labeled flow-based DNS dataset viz. PUF Dataset is presented for detecting compromised hosts in a network. The dataset consists of real flows captured from the Computer Centre of Panjab University that handles the entire campus network. All the flows are labeled using logs which are captured for the signatures implemented in Intrusion Prevention System. The implemented signatures are for DNS anomalies. The final dataset consists of 298463 flows with 260343 benign and 38120 anomalous flows. Profiles have been generated for sub-networks for benign as well as anomalous flows using both statistically derived and entropy based features. These profiles can be used to detect anomalous sub-networks thereby helping to isolate compromised host(s).