A methodology for root-cause analysis in component based systems

In component based enterprise systems, anomaly detectors are commonly deployed on application-level components, but not on lower-level functional components. When anomaly alarms are triggered, system managers are expected to handle them in a timely manner to avoid cascading failures. Excessive large volume of anomaly alarms makes them impractical to handle manually. Most existing root cause analysis methods are based on the assumption that all components are monitored and analysis are performed based on the time correlation of the generated alarms. However, full monitoring coverage may not be practical due to cost and complexity. In this paper, we present RCSF, a root cause analysis method that targets at systems where only application-level components are monitored by anomaly detectors. The method analyzes the components performance log on functional components and seek for most probable fault propagation sequences based on anomaly analysis. We evaluate the RCSF method based on real enterprise system data and compare it with some baseline methods. Experimental results show that our proposed method can effectively anchor the root causes of failures by providing a short list of most probable causes, and the performance is significantly improved compared to the baseline methods.

[1]  Richard Mortier,et al.  Magpie: Online Modelling and Performance-aware Systems , 2003, HotOS.

[2]  Johannes Gehrke,et al.  Sequential PAttern mining using a bitmap representation , 2002, KDD.

[3]  Haifeng Chen,et al.  Discovering likely invariants of distributed transaction systems for autonomic system management , 2006, 2006 IEEE International Conference on Autonomic Computing.

[4]  Joseph L. Hellerstein,et al.  Discovering actionable patterns in event data , 2002, IBM Syst. J..

[5]  Nizar R. Mabroukeh,et al.  A taxonomy of sequential pattern mining algorithms , 2010, CSUR.

[6]  Haifeng Chen,et al.  PeerWatch: a fault detection and diagnosis tool for virtualized consolidation systems , 2010, ICAC '10.

[7]  Ramesh Viswanathan,et al.  A conceptual framework for network management event correlation and filtering systems , 1999, Integrated Network Management VI. Distributed Management for the Networked Millennium. Proceedings of the Sixth IFIP/IEEE International Symposium on Integrated Network Management. (Cat. No.99EX302).

[8]  Sam Shah,et al.  Root cause detection in a service-oriented architecture , 2013, SIGMETRICS '13.

[9]  Mohammed J. Zaki,et al.  SPADE: An Efficient Algorithm for Mining Frequent Sequences , 2004, Machine Learning.

[10]  Hisashi Kashima,et al.  Eigenspace-based anomaly detection in computer systems , 2004, KDD.

[11]  Evan Marcus,et al.  Blueprints for high availability , 2000 .

[12]  Marcos K. Aguilera,et al.  Using the Heartbeat Failure Detector for Quiescent Reliable Communication and Consensus in Partitionable Networks , 1999, Theor. Comput. Sci..

[13]  Qiming Chen,et al.  PrefixSpan,: mining sequential patterns efficiently by prefix-projected pattern growth , 2001, Proceedings 17th International Conference on Data Engineering.

[14]  Manish Gupta,et al.  An open framework for federating integrated management model of distributed it environment , 2008, NOMS 2008 - 2008 IEEE Network Operations and Management Symposium.