Failure Prediction in Computational Grids

Accurate failure prediction in grids is critical for reasoning about QoS guarantees such as job completion time and availability. Statistical methods can be used but they suffer from the fact that they are based on assumptions, such as time-homogeneity, that are often not true. In particular, periodic failures are not modeled well by statistical methods. In this paper, we present an alternative mechanism for failure prediction in which periodic failures are first determined and then filtered from the failure list. The remaining failures are then used in a traditional statistical method. We show that the use of pre-filtering leads to an order of magnitude better predictions

[1]  Walter Willinger,et al.  On the Self-Similar Nature of Ethernet Traffic ( extended version ) , 1995 .

[2]  Walter Willinger,et al.  Self-similarity and heavy tails: structural modeling of network traffic , 1998 .

[3]  Kishor S. Trivedi,et al.  Stochastic Reward Nets for Reliability Prediction , 1996 .

[4]  Richard Wolski,et al.  Predicting the CPU availability of time‐shared Unix systems on the computational grid , 2004, Cluster Computing.

[5]  Eric A. Brewer,et al.  Self-similarity in file systems , 1998, SIGMETRICS '98/PERFORMANCE '98.

[6]  María Engracia Gómez,et al.  Analysis of self-similarity in I/O workload using structural modeling , 1999, MASCOTS '99. Proceedings of the Seventh International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems.

[7]  Marion Kee,et al.  Analysis , 2004, Machine Translation.

[8]  Azer Bestavros,et al.  Self-similarity in World Wide Web traffic: evidence and possible causes , 1996, SIGMETRICS '96.

[9]  Raphael A. Finkel,et al.  A Stable Distributed Scheduling Algorithm , 1981, IEEE International Conference on Distributed Computing Systems.

[10]  Walter Willinger,et al.  Analysis, modeling and generation of self-similar VBR video traffic , 1994, SIGCOMM.

[11]  Marvin Theimer,et al.  Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs , 2000, SIGMETRICS '00.

[12]  Jose C. Principe,et al.  Prediction of Chaotic Time Series with Neural Networks , 1992 .

[13]  Mor Harchol-Balter,et al.  Exploiting process lifetime distributions for dynamic load balancing , 1995, SIGMETRICS.

[14]  R. Eigenmann,et al.  Resource Failure Prediction in Fine-Grained Cycle Sharing Systems , 2005 .

[15]  Anand Sivasubramaniam,et al.  Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[16]  M. Malhotra,et al.  Selecting and implementing phase approximations for semi-Markov models , 1993 .

[17]  Joseph L. Hellerstein,et al.  Mining partially periodic event patterns with unknown periods , 2001, Proceedings 17th International Conference on Data Engineering.

[18]  Ravishankar K. Iyer,et al.  Error/failure analysis using event logs from fault tolerant systems , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[19]  Kishor S. Trivedi Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[20]  Daniel P. Siewiorek,et al.  Error log analysis: statistical modeling and heuristic trend analysis , 1990 .

[21]  Yennun Huang,et al.  Software rejuvenation: analysis, module and applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[22]  Peter A. Dinda,et al.  An evaluation of linear models for host load prediction , 1999, Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).

[23]  Amarnath Mukherjee,et al.  On resource management and QoS guarantees for long range dependent traffic , 1995, Proceedings of INFOCOM'95.

[24]  Rudolf Eigenmann,et al.  Resource Availability Prediction in Fine-Grained Cycle Sharing Systems , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[25]  Richard Wolski,et al.  Automatic methods for predicting machine availability in desktop Grid and peer-to-peer systems , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..