Modeling Probabilistic Measurement Correlations for Problem Determination in Large-Scale Distributed Systems

With the growing complexity in computer systems, it has been a real challenge to detect and diagnose problems in today's large-scale distributed systems. Usually, the correlations between measurements collected across the distributed system contain rich information about the system behaviors, and thus a reasonable model to describe such correlations is crucially important in detecting and locating system problems. In this paper, we propose a transition probability model based on markov properties to characterize pair-wise measurement correlations. The proposed method can discover both the spatial (across system measurements) and temporal (across observation time) correlations, and thus such a model can successfully represent the system normal profiles. Problem determination and localization under this framework is fast and convenient. The framework is general enough to discover any types of correlations (e.g. linear or non-linear). Also, model updating, system problem detection and diagnosis can be conducted effectively and efficiently. Experimental results show that, the proposed method can detect the anomalous events and locate the problematic sources by analyzing the real monitoring data collected from three companies' infrastructures.

[1]  Michael Jiang,et al.  Monitoring multi-tier clustered systems with invariant metric relationships , 2008, SEAMS '08.

[2]  Miroslaw Malek,et al.  Using Hidden Semi-Markov Models for Effective Online Failure Prediction , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[3]  Paramvir Bahl,et al.  Towards highly reliable enterprise network services via inference of multi-level dependencies , 2007, SIGCOMM '07.

[4]  Mark Crovella,et al.  Distributed Spatial Anomaly Detection , 2008, IEEE INFOCOM 2008 - The 27th Conference on Computer Communications.

[5]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[6]  Zhen Guo,et al.  Tracking Probabilistic Correlation of Monitoring Data for Fault Detection in Complex Systems , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[7]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[8]  A. Brix Bayesian Data Analysis, 2nd edn , 2005 .

[9]  Haifeng Chen,et al.  Discovering likely invariants of distributed transaction systems for autonomic system management , 2006, 2006 IEEE International Conference on Autonomic Computing.

[10]  Ramesh Govindan,et al.  Detection and identification of network anomalies using sketch subspaces , 2006, IMC '06.