PeerWatch: a fault detection and diagnosis tool for virtualized consolidation systems

Server virtualization is now becoming an effective means to consolidate numerous applications into a small number of machines. While such a strategy can lead to significant savings in power and hardware cost, it may complicate the fault management task due to the increasing scalability and complexity in the virtualized environment. In this paper, we propose PeerWatch, a fault detection and diagnosis tool specially designed for virtualized consolidation systems. Based on the observation that each application usually reveals itself in multiple instances in the virtualized data center, PeerWatch introduces a statistical technique, canonical correlation analysis (CCA), to extract the correlated characteristics between multiple application instances. The extracted correlations are utilized to examine the status of each application instance. If some correlations drop significantly during the operation, PeerWatch regards that the system is in faulty situation and produces alarms. PeerWatch is robust to system dynamics, compared to traditional fault detection techniques and thus can avoid a lot of false alarms. Once the fault has been detected, PeerWatch proposes a diagnosis process that also takes advantage of the multiple instances feature in the virtualized systems. The diagnosis combines the spatial and temporal analysis on the measurement data across multiple instances before and after the failure. As a result, PeerWatch can obtain much accurate clues about the fault root cause. Experimental results in our virtualized testbed system have demonstrated the effectiveness of the proposed detection and diagnosis tool.

[1]  Evan Marcus,et al.  Blueprints for high availability , 2000 .

[2]  Richard Mortier,et al.  Magpie: Online Modelling and Performance-aware Systems , 2003, HotOS.

[3]  Seraphin B. Calo,et al.  Alarm correlation and fault identification in communication networks , 1994, IEEE Trans. Commun..

[4]  Armando Fox,et al.  Pinpoint: problem determination in large , 2002 .

[5]  Haifeng Chen,et al.  Combining supervised and unsupervised monitoring for fault detection in distributed computing systems , 2006, SAC '06.

[6]  Isabelle Rouvellou,et al.  Automatic alarm correlation for fault identification , 1995, Proceedings of INFOCOM'95.

[7]  Ludmila Cherkasova,et al.  XenMon: QoS Monitoring and Performance Profiling Tool , 2005 .

[8]  R. Suganya,et al.  Data Mining Concepts and Techniques , 2010 .

[9]  Aaron B. Brown,et al.  An active approach to characterizing dynamic dependencies for problem determination in a distributed environment , 2001, 2001 IEEE/IFIP International Symposium on Integrated Network Management Proceedings. Integrated Network Management VII. Integrated Management Strategies for the New Millennium (Cat. No.01EX470).

[10]  Haifeng Chen,et al.  Discovering likely invariants of distributed transaction systems for autonomic system management , 2006, 2006 IEEE International Conference on Autonomic Computing.

[11]  Hisashi Kashima,et al.  Eigenspace-based anomaly detection in computer systems , 2004, KDD.

[12]  Gene H. Golub,et al.  Matrix computations , 1983 .

[13]  H. Hotelling Relations Between Two Sets of Variates , 1936 .

[14]  Xin Li,et al.  Reference-driven performance anomaly identification , 2009, SIGMETRICS '09.

[15]  Evan Marcus,et al.  Blueprints for high availability: designing resilient distributed systems , 2000 .

[16]  Sheng Ma,et al.  Intelligent probing: A cost-effective approach to fault diagnosis in computer networks , 2002, IBM Syst. J..

[17]  Hiroshi Motoda,et al.  Feature Selection for Knowledge Discovery and Data Mining , 1998, The Springer International Series in Engineering and Computer Science.

[18]  Archana Ganapathi,et al.  Predicting Multiple Metrics for Queries: Better Decisions Enabled by Machine Learning , 2009, 2009 IEEE 25th International Conference on Data Engineering.

[19]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[20]  Eric A. Brewer,et al.  Pinpoint: problem determination in large, dynamic Internet services , 2002, Proceedings International Conference on Dependable Systems and Networks.

[21]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[22]  Marcos K. Aguilera,et al.  Using the Heartbeat Failure Detector for Quiescent Reliable Communication and Consensus in Partitionable Networks , 1999, Theor. Comput. Sci..

[23]  Marcos K. Aguilera,et al.  Performance debugging for distributed systems of black boxes , 2003, SOSP '03.