Empirical Comparison of Techniques for Automated Failure Diagnosis

Automated techniques to diagnose the cause of system failures based on monitoring data is an active area of research at the intersection of systems and machine learning. In this paper, we identify three tasks that form key building blocks in automated diagnosis: 1. Identifying distinct states of the system using monitoring data. 2. Retrieving monitoring data from past system states that are similar to the current state. 3. Pinpointing attributes in the monitoring data that indicate the likely cause of a system failure. We provide (to our knowledge) the first apples-to-apples comparison of both classical and state-of-the-art techniques for these three tasks. Such studies are vital to the consolidation and growth of the field. Our study is based on a variety of failures injected in a multitier Web service. We present empirical insights and research opportunities.

[1]  Shivnath Babu,et al.  Guided Problem Diagnosis through Active Learning , 2008, 2008 International Conference on Autonomic Computing.

[2]  Sheng Ma,et al.  Quickly Finding Known Software Problems via Automated Symptom Matching , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[3]  Michael I. Jordan,et al.  Failure diagnosis using decision trees , 2004 .

[4]  Shivnath Babu,et al.  Processing Forecasting Queries , 2007, VLDB.

[5]  Armando Fox,et al.  Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[6]  George Candea,et al.  Combining Visualization and Statistical Analysis to Improve Operator Confidence and Efficiency for Failure Detection and Localization , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[7]  Kamesh Munagala,et al.  Fa: A System for Automating Failure Diagnosis , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[8]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[9]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[10]  George Candea,et al.  Automatic failure-path inference: a generic introspection technique for Internet applications , 2003, Proceedings the Third IEEE Workshop on Internet Applications. WIAPP 2003.

[11]  Dimitrios Gunopulos,et al.  Locally adaptive metrics for clustering high dimensional data , 2007, Data Mining and Knowledge Discovery.

[12]  Philip S. Yu,et al.  Fast algorithms for projected clustering , 1999, SIGMOD '99.

[13]  Ian Witten,et al.  Data Mining , 2000 .