Communication pattern based checkpointing coordination for fault-tolerant distributed computing systems

This paper presents a new checkpointing coordination scheme which utilizes the communication pattern of the cooperating processes. In the proposed scheme, the checkpointing is coordinated for the limited number of processes based on the information regarding the communication pattern of the target program. Unlike the previous solutions which do not utilize the communication pattern, it is possible to reduce the coordination effort as well as the checkpointing frequency. Extensive simulation has been performed to evaluate the performance of the proposed scheme and we concluded that the proposed scheme significantly reduces the checkpointing overhead compared with the loose coordination schemes.

[1]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[2]  Gregory R. Andrews,et al.  Paradigms for process interaction in distributed programs , 1991, CSUR.

[3]  Augusto Ciuffoletti,et al.  A Distributed Domino-Effect free recovery Algorithm , 1984, Symposium on Reliability in Distributed Software and Database Systems.

[4]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[5]  Junguk L. Kim,et al.  An Efficient Protocol for Checkpointing Recovery in Distributed Systems , 1993, IEEE Trans. Parallel Distributed Syst..

[6]  W. Kent Fuchs,et al.  Lazy checkpoint coordination for bounding rollback propagation , 1992, Proceedings of 1993 IEEE 12th Symposium on Reliable Distributed Systems.

[7]  Brian Randell,et al.  Reliability Issues in Computing System Design , 1978, CSUR.