An Autonomous Labeling Pipeline for Intrusion Detection on Enterprise Networks

The volume of cyberattacks has grown exponentially over the last half-decade and shows no signs of slowing down. Additionally, attacks are rapidly evolving and are becoming increasingly more sophisticated. Cyber companies and academics alike have turned to machine learning to build models that learn data-driven rules for threat detection. However, these methods require a substantial amount of training data, and many enterprises lack the infrastructure to label their own network traffic for supervised learning. An added complexity to the labeling problem is that IP addresses are frequently reassigned to new hosts. In this paper, we lay a foundation for an autonomous traffic labeling pipeline that incorporates three different sources of ground truth and requires minimal manual intervention. We apply the labeling pipeline to network traffic data acquired from the University of Virginia. We process the network traffic with a popular network monitoring framework called Zeek, which provides aggregated statistics about the packets exchanged between a source and destination over a certain time interval. Additionally, the labeling pipeline synthesizes data from a network of honeypots compiled by the Duke STINGAR project, a series of nine blacklists, and a whitelist called Cisco Umbrella. We show, using cluster, port, and IP-location analyses, that a labeling methodology that ensembles the different data sources is better than one using only the individual sources. The labeling methodology proposed in the paper will aid enterprise network administrators in building robust intrusion detection systems.

[1]  Christian Rossow,et al.  RUHR-UNIVERSITÄT BOCHUM , 2014 .

[2]  Aiko Pras,et al.  A Labeled Data Set for Flow-Based Intrusion Detection , 2009, IPOM.

[3]  Ibrahim Ghafir,et al.  Blacklist-based malicious IP traffic detection , 2015, 2015 Global Conference on Communication Technologies (GCCT).