Performance Modeling of Resource Failures in Grid Environments

Grid is susceptible to a number of software and hardware failures, so a deep understanding of and modeling the grid resource failures are a challenge and have significant influence on grid researching. However, due to various reasons such as commercial secret and security, it is difficult to obtain real historical logs of grids. Therefore, an accurate model of resource failures is critically useful. In the paper, through analyzing the grid log data, we detail the suitability of three potential statistical distributions for each data set: Weibull, Zipf’s law and Pareto. Then, this paper develops a grid resource failure simulator. Finally, with the different failure patterns generated by the failure simulator, the paper evaluates several common scheduling algorithms used in grid systems.

[1]  Hagbae Kim,et al.  Design and Analysis of an Optimal Instruction-Retry Policy for TMR Controller Computers , 1996, IEEE Trans. Computers.

[2]  Anand Sivasubramaniam,et al.  Critical event prediction for proactive management in large-scale computer clusters , 2003, KDD '03.

[3]  Mark S. Squillante,et al.  Failure data analysis of a large-scale heterogeneous server environment , 2004, International Conference on Dependable Systems and Networks, 2004.

[4]  Takayuki Osogami,et al.  Evaluating availability under quasi-heavy-tailed repair times , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[5]  Byung Kook Kim,et al.  An optimal checkpointing-strategy for real-time control systems under transient faults , 2001, IEEE Trans. Reliab..

[6]  Ravishankar K. Iyer,et al.  Impact of Correlated Failures on Dependability in a VAXcluster System , 1992 .

[7]  Ravishankar K. Iyer,et al.  Failure analysis and modeling of a VAXcluster system , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[8]  Richard P. Martin,et al.  Improving cluster availability using workstation validation , 2002, SIGMETRICS '02.

[9]  Ren Asmussen,et al.  Fitting Phase-type Distributions via the EM Algorithm , 1996 .

[10]  Ken Kennedy,et al.  Scheduling strategies for mapping application workflows onto the grid , 2005, HPDC-14. Proceedings. 14th IEEE International Symposium on High Performance Distributed Computing, 2005..

[11]  Li Fan,et al.  Web caching and Zipf-like distributions: evidence and implications , 1999, IEEE INFOCOM '99. Conference on Computer Communications. Proceedings. Eighteenth Annual Joint Conference of the IEEE Computer and Communications Societies. The Future is Now (Cat. No.99CH36320).

[12]  Hai Jin,et al.  Grid workflow scheduling based on reliability cost , 2007, InfoScale '07.

[13]  Mark S. Squillante,et al.  Performance Implications of Failures in Large-Scale Cluster Scheduling , 2004, JSSPP.