— In grid computing, load balancing with fault tolerance is an important issue. Fault tolerance is an important property in Grid computing as the dependability of individual Grid resources may not be able to be guaranteed. Common fault tolerance techniques in distributed systems are normally achieved with checkpoint-recovery and task replication on alternative resources in cases of a system outage. Grid services are often expected to meet some minimum levels of service for a desirable operation. We proposed a fault tolerant load balancing model to address this issue. We designed and implemented a fault detector and manager in the existing Intra-cluster and Intra-grid load balancing model thereby making it a fault tolerant load balancing model. The performance of task execution was improved due to task migration using fault manager. The performance of our novel fault tolerance technique was compared to the checkpoint-recovery technique. I. INTRODUCTION Grid computing is fast becoming a promising technology due to the collaboration opportunities it creates for organizations to work together to achieve common goals through resource sharing. As more and more critical applications shift to the Grid platform, it becomes increasingly important to ensure their high availability and fault tolerance.
[1]
B. Yagoubi,et al.
A load balancing model for grid environment
,
2007,
2007 22nd international symposium on computer and information sciences.
[2]
William G. Tuel,et al.
Parallel checkpoint/restart without message logging
,
2000,
Proceedings 2000. International Workshop on Parallel Processing.
[3]
Daniel Marques,et al.
Optimizing Checkpoint Size in the C 3 System
,
2005
.
[4]
Laxmikant V. Kalé,et al.
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI
,
2004,
2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).
[5]
Jon B. Weissman.
Fault tolerant computing on the grid: what are my options?
,
1999,
Proceedings. The Eighth International Symposium on High Performance Distributed Computing (Cat. No.99TH8469).