Incremental Clustering for Semi-Supervised Anomaly Detection applied on Log Data

Anomaly detection based on white-listing and self-learning has proven to be a promising approach to detect customized and advanced cyber attacks. Anomaly detection aims at detecting significant deviations from normal system and network behavior. A well-known method to classify anomalous and normal system behavior is clustering of log lines. However, this approach has been applied for forensic purposes only, where log data dumps are investigated retrospectively. In order to make this concept applicable for on-line anomaly detection, i.e., at the time the log lines are produced, some major extensions to existing approaches are required. Especially distance based clustering approaches usually fail building the required large distance matrices and rely on time-consuming recalculations of the cluster-map on every arriving log line. An incremental clustering approach seems suitable to solve this issues. Thus, we introduce a semi-supervised concept for incremental clustering of log data that builds the basis for a novel on-line anomaly detection solution based on log data streams. Its operation is independent from the syntax and semantics of the processed log lines, which makes it generally applicable. We demonstrate that that the introduced anomaly detection approach allows to achieve both a high recall and a high precision while maintaining linear complexity.