Exploratory Data Analysis of a Unified Host and Network Dataset

Exploratory data analysis is invaluable for understanding data, choosing correct models, and interpreting, validating, and applying results. It often leads to the discovery of patterns that can answer a number of research questions. In this paper, we perform exploratory data analysis on cybersecurity data in the NetFlow Dataset from “The Unified Host and Network Dataset”. “The Unified Host and Network Dataset” is a large, open source dataset collected on the Los Alamos National Laboratory (LANL) enterprise network that was published to encourage new research in cybersecurity. The NetFlow Dataset is a compilation of flow logs from routers within the LANL network that are aggregated to a relational format using network stitching. Our exploratory data analysis shows distinct patterns and clusters within a day of data. Specifically, scatter plots of the number of packets sent by the destination device versus the number of packets sent by the source device show three distinct, no-intercept linear relationships between the variables. The relationships suggest three common patterns for how the source device and destination device interactively send packets to each other. Our analysis also shows that byte and packet distributions of connections on rare ports and connections on common ports are statistically different, suggesting these differences can be used to discriminate between normal and abnormal network behavior. Our findings may be useful for research into classification problems with a Unified Host and Network Dataset and for furthering cluster analysis in cybersecurity research.