Detection and Diagnosis of Recurrent Faults in Software Systems by Invariant Analysis

A correctly functioning enterprise-software system exhibits long-term, stable correlations between many of its monitoring metrics. Some of these correlations no longer hold when there is an error in the system, potentially enabling error detection and fault diagnosis. However, existing approaches are inefficient, requiring a large number of metrics to be monitored and ignoring the relative discriminative properties of different metric correlations. In enterprise-software systems, similar faults tend to reoccur. It is therefore possible to significantly improve existing correlation-analysis approaches by learning the effects of common recurrent faults on correlations. We present methods to determine the most significant correlations to track for efficient error detection, and the correlations that contribute the most to diagnosis accuracy. We apply machine learning to identify the relevant correlations, removing the need for manually configured correlation thresholds, as used in the prior approaches. We validate our work on a multi-tier enterprise-software system. We are able to detect and correctly diagnose 8 of 10 injected faults to within three possible causes, and to within two in 7 out of 8 cases. This compares favourably with the existing approaches whose diagnosis accuracy is 3 out of 10 to within 3 possible causes. We achieve a precision of at least 95%.

[1]  Sheng Ma,et al.  Automated Problem Determination Using Call-Stack Matching , 2005, Journal of Network and Systems Management.

[2]  Zhen Guo,et al.  Tracking Probabilistic Correlation of Monitoring Data for Fault Detection in Complex Systems , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[3]  Jacob Cohen,et al.  Applied multiple regression/correlation analysis for the behavioral sciences , 1979 .

[4]  Michael I. Jordan,et al.  Failure diagnosis using decision trees , 2004 .

[5]  Anees Shaikh,et al.  PDA: A Tool for Automated Problem Determination , 2007, LISA.

[6]  David Patterson,et al.  Self-repairing computers. , 2003, Scientific American.

[7]  Vijay Mann,et al.  Problem Determination in Enterprise Middleware Systems using Change Point Correlation of Time Series Data , 2006, 2006 IEEE/IFIP Network Operations and Management Symposium NOMS 2006.

[8]  Jeffrey O. Kephart,et al.  The Vision of Autonomic Computing , 2003, Computer.

[9]  Haifeng Chen,et al.  Efficient and Scalable Algorithms for Inferring Likely Invariants in Distributed Systems , 2007, IEEE Transactions on Knowledge and Data Engineering.

[10]  Haifeng Chen,et al.  Discovering likely invariants of distributed transaction systems for autonomic system management , 2006, 2006 IEEE International Conference on Autonomic Computing.

[11]  Karen Appleby,et al.  Threshold management for problem determination in transaction based e-commerce systems , 2005, 2005 9th IFIP/IEEE International Symposium on Integrated Network Management, 2005. IM 2005..

[12]  Soila Pertet,et al.  Causes of Failure in Web Applications (CMU-PDL-05-109) , 2005 .

[13]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[14]  Paul A. S. Ward,et al.  A comparative study of pairwise regression techniques for problem determination , 2007, CASCON.

[15]  D. Hecker Occupational employment projections to 2014 , 2001 .

[16]  Pierluigi Crescenzi,et al.  Approximation on the Web: A Compendium of NP Optimization Problems , 1997, RANDOM.

[17]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[18]  D. Ohsie,et al.  High speed and robust event correlation , 1996, IEEE Commun. Mag..

[19]  Haifeng Chen,et al.  Modeling and Tracking of Transaction Flow Dynamics for Fault Detection in Complex Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[20]  Paul A. S. Ward,et al.  Leveraging many simple statistical models to adaptively monitor software systems , 2007, Int. J. High Perform. Comput. Netw..

[21]  Haifeng Chen,et al.  Failure detection and localization in component based systems by online tracking , 2005, KDD '05.

[22]  Armando Fox,et al.  Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[23]  S. Weisberg,et al.  Residuals and Influence in Regression , 1982 .

[24]  Armando Fox,et al.  Detecting application-level failures in component-based Internet services , 2005, IEEE Transactions on Neural Networks.

[25]  Paul A. S. Ward,et al.  ADAPTIVE MONITORING IN ENTERPRISE SOFTWARE SYSTEMS , 2006 .