ADELE: Anomaly Detection from Event Log Empiricism

A large population of users gets affected by sudden slowdown or shutdown of an enterprise application. System administrators and analysts spend considerable amount of time dealing with functional and performance bugs. These problems are particularly hard to detect and diagnose in most computer systems, since there is a huge amount of system generated supportability data (counters, logs etc.) that need to be analyzed. Most often, there isn't a very clear or obvious root cause. Timely identification of significant change in application behavior is very important to prevent negative impact on the service. In this paper, we present ADELE, an empirical, data-driven methodology for early detection of anomalies in data storage systems. The key feature of our solution is diligent selection of features from system logs and development of effective machine learning techniques for anomaly prediction. ADELE learns from system's own history to establish the baseline of normal behavior and gives accurate indications of the time period when something is amiss for a system. Validation on more than 4800 actual support cases shows ~ 83% true positive rate and ~ 12% false positive rate in identifying periods when the machine is not performing normally. We also establish the existence of problem “signatures” which help map customer problems to already seen issues in the field. ADELE's capability to predict early paves way for online failure prediction for customer systems.

[1]  Yuanyuan Zhou,et al.  Understanding Customer Problem Troubleshooting from Storage System Logs , 2009, FAST.

[2]  Shailendra Kadre,et al.  Introduction to Statistical Analysis , 2015 .

[3]  Vipul Mathur,et al.  Anode: Empirical detection of performance problems in storage systems using time-series analysis of periodic measurements , 2014, 2014 30th Symposium on Mass Storage Systems and Technologies (MSST).

[4]  Shwetabh Khanduja,et al.  Learning a Hierarchical Monitoring System for Detecting and Diagnosing Service Issues , 2015, KDD.

[5]  Mohamed Hefeeda,et al.  Real-time failure prediction in online services , 2015, 2015 IEEE Conference on Computer Communications (INFOCOM).

[6]  Shunzheng Yu,et al.  Periodic hidden Markov model-based workload clustering and characterization , 2008, 2008 8th IEEE International Conference on Computer and Information Technology.

[7]  Victoria J. Hodge,et al.  A Survey of Outlier Detection Methodologies , 2004, Artificial Intelligence Review.

[8]  Sara McMains,et al.  File System Logging versus Clustering: A Performance Comparison , 1995, USENIX.

[9]  Saeed Amizadeh,et al.  Generic and Scalable Framework for Automated Time-series Anomaly Detection , 2015, KDD.

[10]  Rajeev Thakur,et al.  A Meta-Learning Failure Predictor for Blue Gene/L Systems , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[11]  Anand Sivasubramaniam,et al.  Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[12]  J. A. Hartigan,et al.  A k-means clustering algorithm , 1979 .

[13]  Anand Sivasubramaniam,et al.  Failure Prediction in IBM BlueGene/L Event Logs , 2007, Seventh IEEE International Conference on Data Mining (ICDM 2007).

[14]  Felix Salfner,et al.  Error Log Processing for Accurate Failure Prediction , 2008, WASL.