A New Roll-Forward Checkpointing / Recovery Mechanism for Cluster Federation

Summary In this paper, we have addressed the complex problem of determining a recovery line for cluster federation and proposed an efficient checkpointing / recovery mechanism for it. The main objective of the proposed approach is to advance the recovery line in a cluster federation such that we can put a limit on the amount of rollback by the processes in all the clusters in case of failure(s) in the cluster federation; thereby in the worst case only limited domino effect is allowed in our work. In this approach, processes in different clusters are able to perform their responsibility independently and simultaneously. This inherent parallelism of the algorithm contributes to its speed of execution. We have shown that the proposed approach is superior to the existing works, because neither it suffers from any message storm, nor it takes any unnecessary checkpoints.

[1]  Shahram Rahimi,et al.  A Low-Overhead Non-block Checkpointing Algorithm for Mobile Computing Environment , 2006, GPC.

[2]  S. K. Banerjee,et al.  Design of new roll-forward recovery approach for distributed systems , 2002 .

[3]  Yin-Min Wang,et al.  Consistent Global checkpoints that Contain a Given Set of Local Chekpoints , 1997, IEEE Trans. Computers.

[4]  Jiannong Cao,et al.  Checkpointing in hybrid distributed systems , 2004, 7th International Symposium on Parallel Architectures, Algorithms and Networks, 2004. Proceedings..

[5]  Christine Morin,et al.  Hybrid checkpointing for parallel applications in cluster federations , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..

[6]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[7]  Taesoon Park,et al.  Checkpointing and rollback-recovery in distributed systems , 1989 .

[8]  Bidyut Gupta,et al.  A Fast and Efficient Recovery Scheme for Distributed Programs , 2005, Computers and Their Applications.

[9]  Mukesh Singhal,et al.  Advanced Concepts In Operating Systems , 1994 .

[10]  Xin Qi,et al.  An efficient end-host architecture for cluster communication , 2004, 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935).