Contextual Anomaly Detection for a Critical Industrial System Based on Logs and Metrics

Recent advances in contextual anomaly detection attempt to combine resource metrics and event logs to uncover unexpected system behaviors at run-time. This is highly relevant for critical software systems, where monitoring is often mandated by international standards and guidelines. In this paper, we analyze the effectiveness of a metrics-logs contextual anomaly detection technique in a middleware for Air Traffic Control systems. Our study addresses the challenges of applying such techniques to a new case study with a dense volume of logs, and finer monitoring sampling rate. Guided by our experimental results, we propose and evaluate several actionable improvements, which include a change detection algorithm and the use of time windows on contextual anomaly detection.

[1]  Domenico Cotroneo,et al.  Characterizing Direct Monitoring Techniques in Software Systems , 2016, IEEE Transactions on Reliability.

[2]  Raffaele Della Corte,et al.  Technical Report : Anomaly Detection for a Critical Industrial System using Context , Logs and Metrics , 2018 .

[3]  Tao Wang,et al.  Workload-aware anomaly detection for Web applications , 2014, J. Syst. Softw..

[4]  Evgenia Smirni,et al.  Anomaly? application change? or workload change? towards automated detection of application performance anomaly and change , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[5]  Ingo Weber,et al.  Metric selection and anomaly detection for cloud operations using log and metric correlation analysis , 2017, J. Syst. Softw..

[6]  Arshad Jhumka,et al.  CRUDE: Combining Resource Usage Data and Error Logs for Accurate Error Detection in Large-Scale Distributed Systems , 2016, 2016 IEEE 35th Symposium on Reliable Distributed Systems (SRDS).

[7]  Carl E. Landwehr,et al.  Basic concepts and taxonomy of dependable and secure computing , 2004, IEEE Transactions on Dependable and Secure Computing.

[8]  Ingo Weber,et al.  Experience report: Anomaly detection of cloud application operations using log and cloud metric correlation analysis , 2015, 2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE).

[9]  Erik Elmroth,et al.  Performance Anomaly Detection and Bottleneck Identification , 2015, ACM Comput. Surv..

[10]  VARUN CHANDOLA,et al.  Anomaly detection: A survey , 2009, CSUR.