Ensemble of Bayesian Predictors and Decision Trees for Proactive Failure Management in Cloud Computing Systems

In modern cloud computing systems, hundreds and even thousands of cloud servers are interconnected by multi-layer networks. In such large-scale and complex systems, failures are common. Proactive failure management is a crucial technology to characterize system behaviors and forecast failure dynamics in the cloud. To make failure predictions, we need to monitor the system execution and collect health-related runtime performance data. However, in newly deployed or managed cloud systems, these data are usually unlabeled. Supervised learning based approaches are not suitable in this case. In this paper, we present an unsupervised failure detection method using an ensemble of Bayesian models. It characterizes normal execution states of the system and detects anomalous behaviors. After the anomalies are verified by system administrators, labeled data are available. Then, we apply supervised learning based on decision tree classifiers to predict future failure occurrences in the cloud. Experimental results in an institute-wide cloud computing system show that our methods can achieve high true positive rate and low false positive rate for proactive failure management.

[1]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[2]  Anand Sivasubramaniam,et al.  Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[3]  Cheng-Zhong Xu,et al.  Exploring event correlation for failure prediction in coalitions of clusters , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[4]  Miroslaw Malek,et al.  A survey of online failure prediction methods , 2010, CSUR.

[5]  Song Fu,et al.  auto-AID: A data mining framework for autonomic anomaly identification in networked computer systems , 2010, International Performance Computing and Communications Conference.

[6]  Cheng-Zhong Xu,et al.  Proactive Resource Management for Failure Resilient High Performance Computing Clusters , 2009, 2009 International Conference on Availability, Reliability and Security.

[7]  Richard P. Martin,et al.  Improving cluster availability using workstation validation , 2002, SIGMETRICS '02.

[8]  Anand Sivasubramaniam,et al.  Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[9]  Miroslaw Malek,et al.  Using Hidden Semi-Markov Models for Effective Online Failure Prediction , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[10]  Jeffrey S. Chase,et al.  Correlating Instrumentation Data to System States: A Building Block for Automated Diagnosis and Control , 2004, OSDI.

[11]  Zhiling Lan,et al.  Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study , 2008, 2008 37th International Conference on Parallel Processing.

[12]  Felix Salfner,et al.  Cross-core event monitoring for processor failure prediction , 2009, 2009 International Conference on High Performance Computing & Simulation.

[13]  Mark S. Squillante,et al.  Performance Implications of Failures in Large-Scale Cluster Scheduling , 2004, JSSPP.

[14]  Wei Peng,et al.  Mining Logs Files for Computing System Management , 2005, Second International Conference on Autonomic Computing (ICAC'05).

[15]  Christian Engelmann,et al.  Proactive fault tolerance for HPC with Xen virtualization , 2007, ICS '07.

[16]  Swapna S. Gokhale,et al.  Analytical Models for Architecture-Based Software Reliability Prediction: A Unification Framework , 2006, IEEE Transactions on Reliability.

[17]  Cheng-Zhong Xu,et al.  Quantifying event correlations for proactive failure management in networked computing systems , 2010, J. Parallel Distributed Comput..

[18]  Suman Nath,et al.  Beyond Availability: Towards a Deeper Understanding of Machine Failure Characteristics in Large Distributed Systems , 2004, WORLDS.

[19]  Jason Nieh,et al.  Transparent Checkpoint-Restart of Multiple Processes on Commodity Operating Systems , 2007, USENIX Annual Technical Conference.

[20]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[21]  Petra Perner,et al.  Data Mining - Concepts and Techniques , 2002, Künstliche Intell..

[22]  Xiaobo Zhou,et al.  Regression based multi-tier resource provisioning for session slowdown guarantees , 2010, International Performance Computing and Communications Conference.

[23]  Mark S. Squillante,et al.  Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.

[24]  Anand Sivasubramaniam,et al.  Fault-aware job scheduling for BlueGene/L systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[25]  Bianca Schroeder,et al.  A Large-Scale Study of Failures in High-Performance Computing Systems , 2006, IEEE Transactions on Dependable and Secure Computing.

[26]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[27]  David G. Stork,et al.  Pattern Classification , 1973 .

[28]  Daniel Marques,et al.  Compiler-enhanced incremental checkpointing for OpenMP applications , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[29]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[30]  Chao Wang,et al.  Proactive process-level live migration in HPC environments , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[31]  Raja Nassar,et al.  Availability modeling and analysis on high performance cluster computing systems , 2006, First International Conference on Availability, Reliability and Security (ARES'06).

[32]  Ziming Zhang,et al.  Ensemble of Bayesian Predictors for Autonomic Failure Management in Cloud Computing , 2011, 2011 Proceedings of 20th International Conference on Computer Communications and Networks (ICCCN).

[33]  Xiaobo Zhou,et al.  Coordinated session-based admission control with statistical learning for multi-tier internet applications , 2011, J. Netw. Comput. Appl..

[34]  Brian D. Noble,et al.  Exploiting Availability Prediction in Distributed Systems , 2006, NSDI.

[35]  Ziming Zhang,et al.  Proactive Failure Management by Integrated Unsupervised and Semi-Supervised Learning for Dependable Cloud Systems , 2011, 2011 Sixth International Conference on Availability, Reliability and Security.

[36]  Roy Friedman,et al.  Model-based performance evaluation of distributed checkpointing protocols , 2008, Perform. Evaluation.

[37]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[38]  Michael I. Jordan,et al.  Failure diagnosis using decision trees , 2004 .

[39]  Song Fu,et al.  Anomaly detection in large-scale coalition clusters for dependability assurance , 2010, 2010 International Conference on High Performance Computing.

[40]  Miroslaw Malek,et al.  Proactive fault handling for system availability enhancement , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[41]  Felix Salfner,et al.  Timely Virtual Machine Migration for Pro-active Fault Tolerance , 2011, 2011 14th IEEE International Symposium on Object/Component/Service-Oriented Real-Time Distributed Computing Workshops.

[42]  Song Fu,et al.  Failure-aware resource management for high-availability computing clusters with distributed virtual machines , 2010, J. Parallel Distributed Comput..

[43]  Ziming Zhang,et al.  Failure prediction for autonomic management of networked computer systems with availability assurance , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[44]  Laxmikant V. Kalé,et al.  Proactive Fault Tolerance in MPI Applications Via Task Migration , 2006, HiPC.

[45]  Song Fu Failure-Aware Construction and Reconfiguration of Distributed Virtual Machines for High Availability Computing , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[46]  Cheng-Zhong Xu,et al.  Quantifying Temporal and Spatial Correlation of Failure Events for Proactive Management , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[47]  Zhiling Lan,et al.  A fast restart mechanism for checkpoint/recovery protocols in networked environments , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).