Comparing Checkpoint and Rollback Recovery Schemes in a Cluster System

Cluster systems play a central role to realize high performance computing with relatively low cost, and at the same time are necessary the fault-tolerance features for the practical use. In this paper we develop stochastic models to evaluate the expected total recovery overhead for a cluster computing system with three well-known checkpoint and rollback recovery schemes; checkpoint mirroring, central file server checkpointing and skewed checkpointing, where the fault latency time after a system failure is given by a random variable. In general, since the multi-node failure as well as single-node failure may occur in the cluster system, it is not so easy to obtain the closed form of expected total recovery overhead. Based on a simple failure model, we do this by listing up all the possible combinations of probabilistic events caused by the multi-node failure. Further we compare the respective expected total recovery overhead with different checkpoint and rollback recovery schemes, and evaluate quantitatively the effectiveness of these schemes.

[1]  Erol Gelenbe,et al.  Performance of rollback recovery systems under intermittent failures , 1978, CACM.

[2]  James S. Plank,et al.  Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques , 1996, Proceedings 15th Symposium on Reliable Distributed Systems.

[3]  Lars Lundberg,et al.  Optimal recovery schemes in fault tolerant distributed computing , 2005, Acta Informatica.

[4]  John T. Daly,et al.  Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters , 2010, HPDC '10.

[5]  Stephen L. Scott,et al.  Evaluation of fault-tolerant policies using simulation , 2007, 2007 IEEE International Conference on Cluster Computing.

[6]  Nitin H. Vaidya,et al.  Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme , 1997, IEEE Trans. Computers.

[7]  Bharat K. Bhargava,et al.  Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[8]  Stephen L. Scott,et al.  An optimal checkpoint/restart model for a large scale high performance computing system , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[9]  Erol Gelenbe,et al.  On the Optimum Checkpoint Interval , 1979, JACM.

[10]  Hiroshi Nakamura,et al.  Skewed checkpointing for tolerating multi-node failures , 2004, Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004..

[11]  W. Kent Fuchs,et al.  Optimistic message logging for independent checkpointing in message-passing systems , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[12]  Erol Gelenbe,et al.  Dependable execution of distributed programs , 1995, Simul. Pract. Theory.

[13]  Michael R. Lyu Software Fault Tolerance , 1995 .

[14]  Stephen L. Scott,et al.  A reliability-aware approach for an optimal checkpoint/restart model in HPC environments , 2007, 2007 IEEE International Conference on Cluster Computing.

[15]  Stephen L. Scott,et al.  Reliability-aware Checkpoint/Restart Scheme: A Performability Trade-off , 2005, 2005 IEEE International Conference on Cluster Computing.

[16]  Christine Morin,et al.  Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[17]  Nitin H. Vaidya,et al.  A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.

[18]  Tadashi Dohi,et al.  Numerical computation algorithms for sequential checkpoint placement , 2009, Perform. Evaluation.

[19]  Mark A. Franklin,et al.  Checkpointing in Distributed Computing Systems , 1996, J. Parallel Distributed Comput..

[20]  James S. Plank,et al.  Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..

[21]  Stephen L. Scott,et al.  Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[22]  Nitin H. Vaidya,et al.  A Case for Two-Level Recovery Schemes , 1998, IEEE Trans. Computers.

[23]  Jean-Marc Vincent,et al.  A Flexible Checkpoint/Restart Model in Distributed Systems , 2009, PPAM.

[24]  Wei-Tek Tsai,et al.  A low overhead checkpointing and rollback recovery scheme for distributed systems , 1989, Proceedings of the Eighth Symposium on Reliable Distributed Systems.

[25]  Satish K. Tripathi,et al.  Availability of a distributed computer system with failures , 2004, Acta Informatica.