Determining the Minimum Sample Size of Audit Data Required to Profile User Behavior and Detect Anomaly Intrusion

Although statistical modeling techniques have been employed to detect anomaly intrusion and profile user behavior with network traffic data collected from multi-sites (IP addresses), the minimum sample size of audit data required for each site is unclear. Using the Intrusion Detection Evaluation off-line data developed by the Lincoln Laboratory at Massachusetts Institute of Technology under the Defense Advanced Research Projects Agency, this study aimed to address the challenge of determining sample size. Bivariate analysis was employed to construct a composite score to rank each site’s probability of being an anomaly, and statistical simulations were conducted to evaluate the ranking variation between the population-based “true” pattern of user behavior and different sample-based “observed” patterns. A sequence of hierarchical random effects logistic regression models was fitted to compare the performance of the full dataset-based and sample-based classifications. The results show that a minimum sample size of 500 per site provides a sensitivity value of 0.85, specificity value of 0.92 and kappa statistic of 0.77. Compared with the full dataset-based model, the minimum sample-based model had a similar Receiver Operating Characteristic area (0.983 vs. 0.997) and a slightly higher misclassification rate (3.16% vs. 1.71%) in detecting abnormal patterns.