The Failure Prediction of Cluster Systems Based on System Logs

The failure prediction of cluster systems is an effective approach to improve the reliability of the cluster systems, which is becoming a new research hotspot of high performance computing, especially with the growth of cluster systems and applications both in scale and complexity. A classification sequential rule model is proposed to predict cluster system failures. The system logs of BlueGene/L, Red Storm, and Spirit are used as experimental datasets to predict cluster system failures. The results show that sequential rule approach outperforms SVM and HSMM in terms of precision and F-measure in 5hr prediction window, and in 1hr or 12hr prediction window, sequential rules, SVM and HSMM have their own strengths and weaknesses respectively.

[1]  Anand Sivasubramaniam,et al.  BlueGene/L Failure Analysis and Prediction Models , 2006, International Conference on Dependable Systems and Networks (DSN'06).

[2]  Glenn A. Fink,et al.  Predicting Computer System Failures Using Support Vector Machines , 2008, WASL.

[3]  Jianfeng Zhan,et al.  LogMaster: Mining Event Correlations in Logs of Large-Scale Cluster Systems , 2010, 2012 IEEE 31st Symposium on Reliable Distributed Systems.

[4]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[5]  Ziming Zhang,et al.  Ensemble of Bayesian Predictors and Decision Trees for Proactive Failure Management in Cloud Computing Systems , 2012, J. Commun..

[6]  Rajeev Thakur,et al.  A Meta-Learning Failure Predictor for Blue Gene/L Systems , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[7]  Xiaoshe Dong,et al.  A Survey on Failure Prediction of Large-Scale Server Clusters , 2007 .

[8]  Alberto Sillitti,et al.  Failure prediction based on log files using Random Indexing and Support Vector Machines , 2013, J. Syst. Softw..

[9]  Marcello Cinque,et al.  Log-Based Failure Analysis of Complex Systems: Methodology and Relevant Applications , 2013 .

[10]  Zhiling Lan,et al.  Dynamic Meta-Learning for Failure Prediction in Large-Scale Systems: A Case Study , 2008, 2008 37th International Conference on Parallel Processing.

[11]  Miroslaw Malek,et al.  A survey of online failure prediction methods , 2010, CSUR.

[12]  Hui Xiong,et al.  Failure Prediction in IBM BlueGene/L Event Logs , 2007, ICDM.

[13]  Domenico Cotroneo Innovative Technologies for Dependable OTS-Based Critical Systems , 2013, Springer Milan.

[14]  Wenjian Wang,et al.  Online prediction model based on support vector machine , 2008, Neurocomputing.

[15]  Anand Sivasubramaniam,et al.  Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[16]  Miroslaw Malek,et al.  Using Hidden Semi-Markov Models for Effective Online Failure Prediction , 2007, 2007 26th IEEE International Symposium on Reliable Distributed Systems (SRDS 2007).

[17]  Vipin Kumar,et al.  Mining needle in a haystack: classifying rare classes via two-phase rule induction , 2001, SIGMOD '01.