Fault Tolerance within a Grid Environment

Fault tolerance is an important property in Grid computing as the dependability of individual Grid resources may not be able to be guaranteed; also as resources are used outside of organizational boundaries, it becomes increasingly difficult to guarantee that a resource being used is not malicious in some way. As part of the e-Demand project at the University of Durham we are seeking to develop both an improved fault model for Grid computing and a method for providing fault tolerance for Grid applications that will provide protection against both malicious and erroneous services. We have firstly begun to investigate whether the traditional distributed systems fault model can be readily applied to Grid computing, or whether improvements and alterations need to be made. From our initial investigation, we have concluded that timing, omission and interaction faults may become more prevalent in Grid applications than is the case in traditional distributed systems. From this initial fault model, we have begun to develop an approach for fault tolerance based on the idea of job replication, as anomalous results (either maliciously altered or simply wrong) should be caught at the voting stage. This approach combines a replication-based fault tolerance approach with both dynamic prioritization and dynamic scheduling.

[1]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[2]  Ian T. Foster,et al.  The anatomy of the grid: enabling scalable virtual organizations , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[3]  Algirdas Avizienis,et al.  Software Fault Tolerance , 1989, IFIP Congress.

[4]  K. H. Kim,et al.  Distributed Execution of Recovery Blocks: An Approach to Uniform Treatment of Hardware and Software Faults , 1984, IEEE International Conference on Distributed Computing Systems.