论文信息 - Checkpointing in Distributed Computing Systems

Checkpointing in Distributed Computing Systems

This paper examines the performance of synchronous checkpointing in a distributed computing environment with and without load redistribution. Performance models are developed, and optimum checkpoint intervals are determined. The analysis extends earlier work by allowing for multiple nodes, state-dependent checkpoint intervals, and a performance metric which is coupled with failure-free performance and the speedup functions associated with implementation of parallel algorithms. The analytic results for synchronous checkpointing without load redistribution are compared to measurements of a synthetic parallel algorithm with user-level checkpointing. Expressions for the optimum checkpoint intervals for synchronous checkpointing with and without load redistribution are used to determine when load redistribution is advantageous.

Mark A. Franklin | Kenneth F. Wong | M. Franklin

[1] Erol Gelenbe,et al. On the Optimum Checkpoint Interval , 1979, JACM.

[2] Mark A. Franklin,et al. Distributed computing systems and checkpointing , 1993, [1993] Proceedings The 2nd International Symposium on High Performance Distributed Computing.

[3] Kishor S. Trivedi. Probability and Statistics with Reliability, Queuing, and Computer Science Applications , 1984 .

[4] K. Mani Chandy,et al. Analytic models for rollback and recovery strategies in data base systems , 1975, IEEE Transactions on Software Engineering.

[5] Victor F. Nicola,et al. Comparative Analysis of Different Models of Checkpointing and Recovery , 1990, IEEE Trans. Software Eng..

[6] Willy Zwaenepoel,et al. Manetho: fault tolerance in distributed systems using rollback-recovery and process replication , 1994 .

[7] RecoverySchemesNitin H. VaidyaDepartment. A Case for Multi-Level Distributed , 1994 .

[8] Vaidy S. Sunderam,et al. PVM: A Framework for Parallel Distributed Computing , 1990, Concurr. Pract. Exp..

[9] Terry Williams,et al. Probability and Statistics with Reliability, Queueing and Computer Science Applications , 1983 .