Efficient Algorithms for Crash Recovery in Distributed Systems

We consider the problem of recovering from processor failures efficiently in distributed systems. Each message received is logged in volatile storage when it is processed. At irregular intervals, each processor independently saves the contents of its volatile storage in stable storage. By appending only O(1) extra information to each message, we show that for recovery in general networks O(n2) messages are sufficient and in ring networks Θ(n) messages are necessary and sufficient when an arbitrary number of processors fail. By appending O(n) extra information to each message that is sent, we show that O(kn) messages are sufficient for rollingback all of the processors to the maximum consistent states when there are k failures.

[1]  Jim Gray,et al.  Notes on Data Base Operating Systems , 1978, Advanced Course: Operating Systems.

[2]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[3]  Willy Zwaenepoel,et al.  Recovery in distributed systems using asynchronous message logging and checkpointing , 1988, PODC '88.

[4]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[5]  Hon Fung Li,et al.  Optimal Checkpointing and Local Recording for Domino-Free Rollback Recovery , 1987, Inf. Process. Lett..

[6]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[7]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[8]  Pierre A. Humblet,et al.  A Distributed Algorithm for Minimum-Weight Spanning Trees , 1983, TOPL.

[9]  A. Prasad Sistla,et al.  Efficient distributed recovery using message logging , 1989, PODC '89.

[10]  David L. Presotto,et al.  Publishing: a reliable broadcast communication mechanism , 1983, SOSP '83.

[11]  Baruch Awerbuch,et al.  Applying static network protocols to dynamic networks , 1987, 28th Annual Symposium on Foundations of Computer Science (sfcs 1987).

[12]  Jacques Malenfant,et al.  Computing Optimal Checkpointing Strategies for Rollback and Recovery Systems , 1988, IEEE Trans. Computers.

[13]  Sang Hyuk Son,et al.  Distributed Checkpointing for Globally Consistent States of Databases , 1989, IEEE Transactions on Software Engineering.

[14]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.