Ranking the importance of alerts for problem determination in large computer systems

The complexity of large computer systems has raised unprecedented challenges for system management. In practice, operators often collect large volume of monitoring data from system components and set up many rules to check data and trigger alerts. However, the alerts from various rules usually have different problem reporting accuracy because their thresholds are often manually set based on operators’ experience and intuition. Meantime, due to system dependencies, a single problem may trigger many alerts at the same time in large systems and the critical question is which alert should be analyzed first in the following problem determination process. In this paper, we propose a novel peer review mechanism to rank the importance of alerts and the top ranked alerts are more likely to be true positives. After comparing a metric value against its threshold to generate alerts, we also compare the value with the equivalent thresholds from many other rules to determine the importance of alerts. Our approach is evaluated with a real test bed system and experimental results are also included to demonstrate its effectiveness.

[1]  Calton Pu,et al.  Issues in Bottleneck Detection in Multi-Tier Enterprise Applications , 2006, 200614th IEEE International Workshop on Quality of Service.

[2]  Boris Gruschke,et al.  INTEGRATED EVENT MANAGEMENT: EVENT CORRELATION USING DEPENDENCY GRAPHS , 1998 .

[3]  KellyTerence,et al.  Capturing, indexing, clustering, and retrieving system history , 2005 .

[4]  Archana Ganapathi,et al.  Why Do Internet Services Fail, and What Can Be Done About It? , 2002, USENIX Symposium on Internet Technologies and Systems.

[5]  Jean-Yves Le Boudec,et al.  An alarm filtering algorithm for optical communication networks , 1997, MMNS.

[6]  Haifeng Chen,et al.  Discovering likely invariants of distributed transaction systems for autonomic system management , 2006, 2006 IEEE International Conference on Autonomic Computing.

[7]  Armando Fox,et al.  Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[8]  Calton Pu,et al.  Comparison of Performance Analysis Approaches for Bottleneck Detection in Multi-Tier Enterprise Applications , .

[9]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.

[10]  Haifeng Chen,et al.  Modeling and Tracking of Transaction Flow Dynamics for Fault Detection in Complex Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[11]  David A. Patterson,et al.  Path-Based Failure and Evolution Management , 2004, NSDI.

[12]  D. Ohsie,et al.  High speed and robust event correlation , 1996, IEEE Commun. Mag..

[13]  Lennart Ljung,et al.  System Identification: Theory for the User , 1987 .

[14]  David A. Patterson,et al.  A Simple Way to Estimate the Cost of Downtime , 2002, LISA.

[15]  K AguileraMarcos,et al.  Performance debugging for distributed systems of black boxes , 2003 .

[16]  Zhen Guo,et al.  Tracking Probabilistic Correlation of Monitoring Data for Fault Detection in Complex Systems , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[17]  Haifeng Chen,et al.  Efficient and Scalable Algorithms for Inferring Likely Invariants in Distributed Systems , 2007, IEEE Transactions on Knowledge and Data Engineering.