论文信息 - Comparing Checkpoint and Rollback Recovery Schemes in a Cluster System

Comparing Checkpoint and Rollback Recovery Schemes in a Cluster System

Cluster systems play a central role to realize high performance computing with relatively low cost, and at the same time are necessary the fault-tolerance features for the practical use. In this paper we develop stochastic models to evaluate the expected total recovery overhead for a cluster computing system with three well-known checkpoint and rollback recovery schemes; checkpoint mirroring, central file server checkpointing and skewed checkpointing, where the fault latency time after a system failure is given by a random variable. In general, since the multi-node failure as well as single-node failure may occur in the cluster system, it is not so easy to obtain the closed form of expected total recovery overhead. Based on a simple failure model, we do this by listing up all the possible combinations of probabilistic events caused by the multi-node failure. Further we compare the respective expected total recovery overhead with different checkpoint and rollback recovery schemes, and evaluate quantitatively the effectiveness of these schemes.

Tadashi Dohi | Noriaki Bessho

[1] Erol Gelenbe,et al. Performance of rollback recovery systems under intermittent failures , 1978, CACM.

[2] James S. Plank,et al. Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques , 1996, Proceedings 15th Symposium on Reliable Distributed Systems.

[3] Lars Lundberg,et al. Optimal recovery schemes in fault tolerant distributed computing , 2005, Acta Informatica.

[4] John T. Daly,et al. Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters , 2010, HPDC '10.

[5] Stephen L. Scott,et al. Evaluation of fault-tolerant policies using simulation , 2007, 2007 IEEE International Conference on Cluster Computing.

[6] Nitin H. Vaidya,et al. Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme , 1997, IEEE Trans. Computers.

[7] Bharat K. Bhargava,et al. Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[8] Stephen L. Scott,et al. An optimal checkpoint/restart model for a large scale high performance computing system , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[9] Erol Gelenbe,et al. On the Optimum Checkpoint Interval , 1979, JACM.

[10] Hiroshi Nakamura,et al. Skewed checkpointing for tolerating multi-node failures , 2004, Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems, 2004..

[11] W. Kent Fuchs,et al. Optimistic message logging for independent checkpointing in message-passing systems , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[12] Erol Gelenbe,et al. Dependable execution of distributed programs , 1995, Simul. Pract. Theory.

[13] Michael R. Lyu. Software Fault Tolerance , 1995 .

[14] Stephen L. Scott,et al. A reliability-aware approach for an optimal checkpoint/restart model in HPC environments , 2007, 2007 IEEE International Conference on Cluster Computing.

[15] Stephen L. Scott,et al. Reliability-aware Checkpoint/Restart Scheme: A Performability Trade-off , 2005, 2005 IEEE International Conference on Cluster Computing.

[16] Christine Morin,et al. Handling Persistent States in Process Checkpoint/Restart Mechanisms for HPC Systems , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.

[17] Nitin H. Vaidya,et al. A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.

[18] Tadashi Dohi,et al. Numerical computation algorithms for sequential checkpoint placement , 2009, Perform. Evaluation.

[19] Mark A. Franklin,et al. Checkpointing in Distributed Computing Systems , 1996, J. Parallel Distributed Comput..

[20] James S. Plank,et al. Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems , 2001, J. Parallel Distributed Comput..

[21] Stephen L. Scott,et al. Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[22] Nitin H. Vaidya,et al. A Case for Two-Level Recovery Schemes , 1998, IEEE Trans. Computers.

[23] Jean-Marc Vincent,et al. A Flexible Checkpoint/Restart Model in Distributed Systems , 2009, PPAM.

[24] Wei-Tek Tsai,et al. A low overhead checkpointing and rollback recovery scheme for distributed systems , 1989, Proceedings of the Eighth Symposium on Reliable Distributed Systems.

[25] Satish K. Tripathi,et al. Availability of a distributed computer system with failures , 2004, Acta Informatica.