A Survey on Failure Prediction of Large-Scale Server Clusters

As the size and complexity of cluster systems grows, failure rates accelerate dramatically. To reduce the disaster caused by failures, it is desirable to identify the potential failures ahead of their occurrence. In this paper, we survey the state of the art in failure prediction of cluster systems. The characteristic of failures in cluster systems are addressed, and some statistic results are shown. We explore the ways of the collection and preprocessing of data for failure prediction, and suggest a procedure for preprocessing the records in automatically generated log files. Focused on the main idea of five prediction methods, including statistic based threshold, time series analysis, rule-based classification, Bayesian network models and semi-Markov process models, are analyzed respectively. In addition, concerning the accuracy and practicality, we present five metrics for evaluating the failure prediction techniques and compare the five techniques with the five metrics.

[1]  Brian Randell,et al.  On Failures and Faults , 2003, FME.

[2]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[3]  Dorothy M. Andrews,et al.  A Methodology for Analysis of Failure Prediction Data , 1985, RTSS.

[4]  Miroslaw Malek,et al.  Advanced Failure Prediction in Complex Software Systems , 2004 .

[5]  Miroslaw Malek,et al.  Predicting failures of computer systems: a case study for a telecommunication system , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[6]  Rudolf Eigenmann,et al.  Resource Availability Prediction in Fine-Grained Cycle Sharing Systems , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[7]  Qian Pei-de Server Load Prediction Based on Time Series , 2006 .

[8]  Kishor S. Trivedi,et al.  Analysis and implementation of software rejuvenation in cluster systems , 2001, SIGMETRICS '01.

[9]  Enrique F. Castillo,et al.  Expert Systems and Probabilistic Network Models , 1996, Monographs in Computer Science.

[10]  Xiaoshe Dong,et al.  AOCMS: An Adaptive and Scalable Monitoring System for Large-Scale Clusters , 2006, 2006 IEEE Asia-Pacific Conference on Services Computing (APSCC'06).

[11]  Zou Bo,et al.  ARMA-BASED TRAFFIC PREDICTION AND OVERLOAD DETECTION OF NETWORK , 2002 .

[12]  Richard P. Martin,et al.  Improving cluster availability using workstation validation , 2002, SIGMETRICS '02.

[13]  Anand Sivasubramaniam,et al.  Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[14]  Miroslaw Malek,et al.  Comprehensive logfiles for autonomic systems , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[15]  Richard Wolski,et al.  Modeling Machine Availability in Enterprise and Wide-Area Distributed Computing Environments , 2005, Euro-Par.

[16]  Anand Sivasubramaniam,et al.  Filtering failure logs for a BlueGene/L prototype , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).