Log-Based Anomaly Detection for System Surveillance

As log files increase in size, it becomes increasingly difficult to manually detect errors within them. There is a need for automated tools for anomaly detection that do not require human assistance. This thesis aims to develop a prototype for such a tool that can be used to monitor the system state based on the produced log files. A specific and a generic approach for analyzing the data is explored to form a foundation for design decisions. Insights from the approaches are then used to build the prototype, which is done in three stages consisting of a basic prototype, extension of the prototype, and evaluation. The prototype is evaluated based on a number of interviews as well as through finding its accuracy and performance. The resulting prototype graphs total lines, words and bigrams per hour. It visualizes the words, bigrams and anomalous messages that occur in each log file. A user specified blacklist highlights undesired words in any file. Anomaly detection is done by comparing historical and current values while taking the overall trends into account. The prototype was found to be useful by two professionals whose work involve log handling, and the interface was thought to be functional. It is able to correctly handle most data but suffers from false alarms, and found 11 out of 14 known errors. A shift in normality is handled well, and the prototype adapts within a week. In conclusion, the developed prototype is usable, mainly for large log files. It requires more accurate anomaly detection, and the interface can be further improved.

[1]  Chris Phillips,et al.  Logging and Log Management: The Authoritative Guide to Understanding the Concepts Surrounding Logging and Log Management , 2012 .

[2]  Raymond T. Ng,et al.  Algorithms for Mining Distance-Based Outliers in Large Datasets , 1998, VLDB.

[3]  Andrew Hay,et al.  OSSEC Host-Based Intrusion Detection Guide , 2008 .

[4]  Peter Jackson,et al.  Natural language processing for online applications : text retrieval, extraction and categorization , 2002 .

[5]  Srinivasan Parthasarathy,et al.  Fast mining of distance-based outliers in high-dimensional datasets , 2008, Data Mining and Knowledge Discovery.

[6]  Philip S. Yu,et al.  Data Mining: An Overview from a Database Perspective , 1996, IEEE Trans. Knowl. Data Eng..

[7]  Chia-Hui Chang,et al.  Categorical data visualization and clustering using subjective factors , 2005, Data Knowl. Eng..

[8]  Juha Vesanto,et al.  SOM-based data visualization methods , 1999, Intell. Data Anal..

[9]  Jaideep Srivastava,et al.  A hybrid-logic approach towards fault detection in complex cyber-physical systems , 2010 .

[10]  Colin Ware,et al.  Information Visualization: Perception for Design , 2000 .

[11]  Kim Schaffer,et al.  An Overview of Anomaly Detection , 2013, IT Professional.

[12]  D. Hand,et al.  Unsupervised Profiling Methods for Fraud Detection , 2002 .

[13]  Setsuo Ohsuga,et al.  INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES , 1977 .

[14]  D. Jayathilake,et al.  Towards structured log analysis , 2012, 2012 Ninth International Conference on Computer Science and Software Engineering (JCSSE).

[15]  Adam Jacobs,et al.  The pathologies of big data , 2009, Commun. ACM.

[16]  Esmaeili Mohammadjafar,et al.  Stream Data Mining and Anomaly Detection , 2011 .

[17]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[18]  David J. Hill,et al.  Anomaly detection in streaming environmental sensor data: A data-driven modeling approach , 2010, Environ. Model. Softw..

[19]  Thomas J. Veasey,et al.  Anomaly Detection in Application Performance Monitoring Data , 2014 .

[20]  Salvatore J. Stolfo,et al.  Adaptive Intrusion Detection: A Data Mining Approach , 2000, Artificial Intelligence Review.

[21]  Brian Tierney,et al.  Log summarization and anomaly detection for troubleshooting distributed systems , 2007, 2007 8th IEEE/ACM International Conference on Grid Computing.

[22]  Kenji Yamanishi,et al.  Dynamic syslog mining for network failure monitoring , 2005, KDD '05.