Fault Tolerance-Genetic Algorithm for Grid Task Scheduling using Check Point

One motivation of grid computing is to aggregate the power of widely distributed resources, and provide non-trivial services to users. To achieve this goal, an efficient grid fault tolerance system is an essential part of the grid. Rather than covering the whole grid fault tolerance area, this survey provides a review of the subject mainly from the perspective of check point. In this review the challenges for fault tolerance are identified. In grid environments, execution failures can occur for various reasons such as network failure, overloaded resource conditions, or non-availability of required software components. Thus, fault-tolerant systems should be able to identify and handle failures and support reliable execution in the presence of concurrency and failures. In scheduling a large number of user jobs for parallel execution on an open-resource grid system, the jobs are subject to system failures or delays caused by infected hardware, software vulnerability, and distrusted security policy. In this paper we propose a task level fault tolerance. Task-level techniques mask the effects of the execution failure of tasks. Four task level techniques are retry, alternate resource, check point and replication. Check point technique strategy achieves optimal load balance across different grid sites. These fault tolerance task level techniques can upgrade grid performance significantly at only a moderate in extra resources or scheduling delays in a risky grid computing environment.

[1]  Rajkumar Buyya,et al.  A taxonomy and survey of grid resource management systems for distributed computing , 2002, Softw. Pract. Exp..

[2]  Rajkumar Buyya,et al.  Grids and Grid technologies for wide‐area distributed computing , 2002, Softw. Pract. Exp..

[3]  Bin Cong,et al.  Scalable Parallel Computing: Technology, Architecture, Programming , 1999, Scalable Comput. Pract. Exp..

[4]  Soonwook Hwang,et al.  Grid workflow: a flexible failure handling framework for the grid , 2003, High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on.

[5]  Albert Y. Zomaya,et al.  An Introduction to Genetic-Based Scheduling in Parallel-Processor Systems , 2001 .

[6]  Ladislau Bölöni,et al.  A Comparison of Eleven Static Heuristics for Mapping a Class of Independent Tasks onto Heterogeneous Distributed Computing Systems , 2001, J. Parallel Distributed Comput..

[7]  Shanshan Song,et al.  Non-Cooperative Grids: Game-Theoretic Modeling and Strategy Optimization , 2005 .

[8]  Xiao Qin,et al.  SAREC: a security-aware scheduling strategy for real-time applications on clusters , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[9]  Dhiraj K. Pradhan,et al.  An efficient coordinated checkpointing scheme for multicomputers , 1994, Proceedings of IEEE Workshop on Fault-Tolerant Parallel and Distributed Systems.

[10]  Xiao Qin,et al.  Enhancing Security of Real-Time Applications on Grids Through Dynamic Scheduling , 2005, JSSPP.

[11]  Ishfaq Ahmad,et al.  Link contention-constrained scheduling and mapping of tasks and messages to a network of heterogeneous processors , 2004, Cluster Computing.

[12]  Kenichi Hagihara,et al.  Near-optimal dynamic task scheduling of independent coarse-grained tasks onto a computational grid , 2003, 2003 International Conference on Parallel Processing, 2003. Proceedings..

[13]  R. F. Freund,et al.  Dynamic Mapping of a Class of Independent Tasks onto Heterogeneous Computing Systems , 1999, J. Parallel Distributed Comput..

[14]  Ian T. Foster,et al.  The anatomy of the grid: enabling scalable virtual organizations , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[15]  Soonwook Hwang,et al.  A Flexible Framework for Fault Tolerance in the Grid , 2003, Journal of Grid Computing.