Identifying Recurrent and Unknown Performance Issues

For a large-scale software system, especially an online service system, when a performance issue occurs, it is desirable to check whether this issue has occurred before. If there are past similar issues, a known remedy could be applied. Otherwise, a new troubleshooting process may have to be initiated. The symptom of a performance issue can be characterized by a set of metrics. Due to the sophisticated nature of software systems, manual diagnosis of performance issues based on metric data is typically expensive and laborious. In this paper, we propose a Hidden Markov Random Field (HMRF) based approach to automatic identification of recurrent and unknown performance issues. We formulate the problem of issue identification as a HMRF-based clustering problem. Our approach incorporates the learning of metric discretization thresholds and the optimization of issue clustering. Based on the learned thresholds and cluster centroids, we can achieve accurate identification of recurrent issues and unknown issues. Experimental evaluations on an open benchmark and a large-scale industrial production system show that our approach is effective and outperforms the related state-of-the-art approaches.

[1]  Xiao-Hua Zhou,et al.  Statistical Methods in Diagnostic Medicine , 2002 .

[2]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[3]  Armando Fox,et al.  Capturing, indexing, clustering, and retrieving system history , 2005, SOSP '05.

[4]  Stephen M. Smith,et al.  Hidden Markov random field model and segmentation of brain MR images , 2001 .

[5]  Dan Roth,et al.  Automated and Adaptive Threshold Setting: Enabling Technology for Autonomy and Self-Management , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[6]  Armando Fox,et al.  Detecting application-level failures in component-based Internet services , 2005, IEEE Transactions on Neural Networks.

[7]  Qiang Fu,et al.  Healing online service systems via mining historical issue repositories , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[8]  Bronis R. de Supinski,et al.  Automatic fault characterization via abnormality-enhanced classification , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[9]  Navendu Jain,et al.  Understanding network failures in data centers , 2011, SIGCOMM 2011.

[10]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[11]  Hongzhe Li,et al.  A hidden Markov random field model for genome-wide association studies. , 2010, Biostatistics.

[12]  Shivnath Babu,et al.  Guided Problem Diagnosis through Active Learning , 2008, 2008 International Conference on Autonomic Computing.

[13]  Qiang Fu,et al.  Performance Issue Diagnosis for Online Service Systems , 2012, 2012 IEEE 31st Symposium on Reliable Distributed Systems.

[14]  Elaine J. Weyuker,et al.  Experience with Performance Testing of Software Systems: Issues, an Approach, and Case Study , 2000, IEEE Trans. Software Eng..

[15]  Padhraic Smyth,et al.  Markov monitoring with unknown states , 1994, IEEE J. Sel. Areas Commun..

[16]  Navendu Jain,et al.  Understanding network failures in data centers: measurement, analysis, and implications , 2011, SIGCOMM.

[17]  Stephen M. Smith,et al.  Segmentation of brain MR images through a hidden Markov random field model and the expectation-maximization algorithm , 2001, IEEE Transactions on Medical Imaging.

[18]  Sangameshwar Patil,et al.  Automated debugging of SLO violations in enterprise systems , 2010, 2010 Second International Conference on COMmunication Systems and NETworks (COMSNETS 2010).

[19]  Wei Xu,et al.  Machine Learning for Multimedia Content Analysis , 2007 .

[20]  Spiros Mancoridis,et al.  On the use of computational geometry to detect software faults at runtime , 2010, ICAC '10.

[21]  Armando Fox,et al.  Fingerprinting the datacenter: automated classification of performance crises , 2010, EuroSys '10.

[22]  Armando Fox,et al.  Ensembles of models for automated diagnosis of system performance problems , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[23]  Ahmed E. Hassan,et al.  Automatic detection of performance deviations in the load testing of Large Scale Systems , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[24]  B. Ripley,et al.  Pattern Recognition , 1968, Nature.