A non-intrusive minimum process synchronous checkpointing protocol for mobile distributed systems

Mobile computing raises many new issues, such as lack of stable storage, low bandwidth of wireless channels, high mobility and limited battery life. These issues make traditional checkpointing algorithms unsuitable for checkpointing mobile distributed systems. Minimum process coordinated checkpointing is a good approach to introducing fault tolerance in a distributed system transparently. This approach is domino-free and requires at most two checkpoints of each process on stable storage, and forces only interacting processes to checkpoint. Sometimes, it also requires piggybacking of information onto normal messages, blocking of the underlying computation or taking some useless checkpoints. In this paper, we propose a non-intrusive minimum process synchronous checkpointing protocol for mobile distributed systems, where only the minimum number of tentative checkpoints is taken. We also optimize the number of useless forced (mutable) checkpoints and message overheads as compared to Cao et al. (2001).

[1]  Achour Mostéfaoui,et al.  A communication-induced checkpointing protocol that ensures rollback-dependency trackability , 1997, Proceedings of IEEE 27th International Symposium on Fault Tolerant Computing.

[2]  Mukesh Singhal,et al.  On the impossibility of min-process non-blocking checkpointing and an efficient checkpointing algorithm for mobile computing systems , 1998, Proceedings. 1998 International Conference on Parallel Processing (Cat. No.98EX205).

[3]  B. R. Badrinath,et al.  Checkpointing distributed applications on mobile computers , 1994, Proceedings of 3rd International Conference on Parallel and Distributed Information Systems.

[4]  Makoto Takizawa,et al.  Checkpoint-recovery protocol for reliable mobile systems , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[5]  Mukesh Singhal,et al.  Mutable Checkpoints: A New Checkpointing Approach for Mobile Computing Systems , 2001, IEEE Trans. Parallel Distributed Syst..

[6]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[7]  Mukesh Singhal,et al.  Low-Cost Checkpointing and Failure Recovery in Mobile Computing Systems , 1996, IEEE Trans. Parallel Distributed Syst..

[8]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[9]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.