A reliable checkpoint storage strategy for grid

Computational grids are composed of heterogeneous autonomously managed resources. In such environment, any resource can join or leave the grid at any time. It makes the grid infrastructure unreliable in nature resulting in delay and failure of executing jobs. Thus, fault tolerance becomes a vital aspect of grid for realizing reliability, availability and quality-of-service. The most common technique, for achieving fault tolerance, used in High Performance Computing is rollback recovery. It relies on the availability of checkpoints and stability of storage media. Thus the checkpoints are replicated on storage media. It increases the job execution time, if replication is not done in proper manner. Furthermore, dedicating powerful resources solely as checkpoint storage results in loss of computation power of these resources. It may results in bottlenecks, when the load on the network is high. To address the problem, in this paper checkpoint replication based fault tolerance strategy named as Reliable Checkpoint Storage Strategy (RCSS) is proposed. In RCSS, the checkpoints are replicated on all checkpoint servers in the grid in distributed manner. It decreases the checkpoint replication time and in turn improves the overall job execution time. Additionally, if a resource fails during execution of a job, the RCSS restarts the job from its last valid checkpoint taken from any checkpoint server in the grid. Furthermore to increase the grid performance, CPU cycles of checkpoint servers are also utilized during high load on network. To evaluate the performance of RCSS simulations are carried out using GridSim. The simulation results show that RCSS outperforms in intra-cluster Checkpoint wave completion time by 12.5 % with varying number of checkpoint servers. RCSS also reduces checkpoint wave completion time by 50 % with varying number of clusters. Additionally RCSS reduces replication time within cluster by 39.5 %.

[1]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[2]  Benjamin C. Pierce,et al.  Regular expression types for XML , 2000, TOPL.

[3]  Rajkumar Buyya,et al.  A Taxonomy of Workflow Management Systems for Grid Computing , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[4]  Miguel Castro,et al.  Farsite: federated, available, and reliable storage for an incompletely trusted environment , 2002, OPSR.

[5]  Pangfeng Liu,et al.  QoS-aware, access-efficient, and storage-efficient replica placement in grid environments , 2008, The Journal of Supercomputing.

[6]  Fatiha Bouabache,et al.  Hierarchical Replication Techniques to Ensure Checkpoint Storage Reliability in Grid Environment , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[7]  Paul D. Manuel,et al.  A hybrid fault tolerance technique in grid computing system , 2011, The Journal of Supercomputing.

[8]  V. R. Uthariaraj,et al.  FAULT TOLERANT SCHEDULING STRATEGY FOR COMPUTATIONAL GRID ENVIRONMENT , 2010 .

[9]  Ian T. Foster,et al.  The Anatomy of the Grid: Enabling Scalable Virtual Organizations , 2001, Int. J. High Perform. Comput. Appl..

[10]  James S. Plank,et al.  Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques , 1996, Proceedings 15th Symposium on Reliable Distributed Systems.

[11]  Rajkumar Buyya,et al.  GridSim: a toolkit for the modeling and simulation of distributed resource management and scheduling for Grid computing , 2002, Concurr. Comput. Pract. Exp..

[12]  Paul D. Manuel,et al.  Adaptive checkpointing strategy to tolerate faults in economy based grid , 2008, The Journal of Supercomputing.

[13]  Ben Y. Zhao,et al.  OceanStore: an architecture for global-scale persistent storage , 2000, SIGP.

[14]  Morteza Analoui,et al.  A Novel Process Mapping Strategy in Clustered Environments , 2012, Grid 2012.

[15]  Adrian J. Shepherd,et al.  A computational Grid framework for immunological applications , 2009, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences.

[16]  Paul D. Manuel,et al.  Replication based fault tolerant job scheduling strategy for economy driven grid , 2012, The Journal of Supercomputing.

[17]  Peter Sobe Stable checkpointing in distributed systems without shared disks , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[18]  Fabio Kon,et al.  Strategies for Checkpoint Storage on Opportunistic Grids , 2006, IEEE Distributed Systems Online.

[19]  Shahram Rahimi,et al.  Domino-Effect Free Crash Recovery for Concurrent Failures in Cluster Federation , 2008, GPC.

[20]  Zizhong Chen,et al.  A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing , 2008, 2008 11th IEEE High Assurance Systems Engineering Symposium.

[21]  P. Latchoumy,et al.  SURVEY ON FAULT TOLERANCE IN GRID COMPUTING , 2011 .

[22]  Ian T. Foster,et al.  The anatomy of the grid: enabling scalable virtual organizations , 2001, Proceedings First IEEE/ACM International Symposium on Cluster Computing and the Grid.

[23]  Kalim Qureshi,et al.  Performance evaluation of fault tolerance techniques in grid computing system , 2010, Comput. Electr. Eng..

[24]  Fatiha Bouabache,et al.  Hierarchical replication techniques to ensure checkpoint storage reliability in grid environment , 2008, AICCSA.