Failure Prediction for Cloud Datacenter by Hybrid Message Pattern Learning

In operations and management of large-scale cloud data enters, it is essential for administrators to handle failures occurring in their infrastructure before causing service-level violations. Some techniques for failure prediction have been studied because they can be used to start the troubleshooting process at the early stage of troubles and to prevent service-level violations from occurring. By its nature, however, failure prediction involves a certain amount of incorrect detection (false-positive). When applying failure prediction to the operation and management of cloud data centers, incorrect detection can result in the execution of unnecessary workaround tasks and additional costs. Existing methods for failure prediction using Bayesian inference to identify message patterns related to a certain failure are difficult to apply to relatively stable systems, because the accuracy of their predictions deteriorates in environments where failure rarely occurs. In order to solve this problem, we propose a novel method to improve the accuracy of failure prediction by suppressing incorrect detections using a hybrid score that integrates the probability of simultaneous occurrence between a message pattern and a failure and frequency of the message patterns for the failure. We implemented this method and evaluated the accuracy in a real commercial cloud data enter. The evaluation results revealed that it improved the accuracy of failure prediction by 31.9% compared with the existing method in terms of precision in the best case.

[1]  Pedro Capelastegui,et al.  An online failure prediction system for private IaaS platforms , 2013, DISCCO '13.

[2]  Xavier Franch,et al.  Usage-Based Online Testing for Proactive Adaptation of Service-Based Applications , 2011, 2011 IEEE 35th Annual Computer Software and Applications Conference.

[3]  Roberto Baldoni,et al.  Online Black-Box Failure Prediction for Mission Critical Distributed Systems , 2012, SAFECOMP.

[4]  Marco Canini,et al.  Fault prediction in distributed systems gone wild , 2010, LADIS '10.

[5]  Zhiling Lan,et al.  Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study , 2008, 2008 37th International Conference on Parallel Processing.

[6]  Shinji Kikuchi,et al.  Online failure prediction in cloud datacenters by real-time message pattern learning , 2012, 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings.

[7]  Miroslaw Malek,et al.  A survey of online failure prediction methods , 2010, CSUR.

[8]  Zhiling Lan,et al.  Practical online failure prediction for Blue Gene/P: Period-based vs event-driven , 2011, 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks Workshops (DSN-W).

[9]  Franck Cappello,et al.  Adaptive event prediction strategy with dynamic time window for large-scale HPC systems , 2011, SLAML '11.

[10]  Miroslaw Malek,et al.  Using Hidden Semi-Markov Models for Effective Online Failure Prediction , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[11]  Alexander Clemm,et al.  NETradamus: A forecasting system for system event messages , 2010, 2010 IEEE Network Operations and Management Symposium - NOMS 2010.

[12]  A. Tversky,et al.  On the psychology of prediction , 1973 .