Trade-offs in implementing causal message logging protocols

Casual message logging protocols [3] have several attractive properties: they introduce no blocking, send no additional messages over those sent by the application, and can never cause orphans to be created by crashes. Causal message logging, however, does require additional data to be piggybacked on application messages. The amount of such piggybacked data can become large. In this paper, we present five different implementations of casual message logging. All of the corresponding protocols are parameterized by ~, the maximum number of processes that can fail concurrently. We also explore how the application’s communication structure can be exploited to limit the amount of piggybacked data.

[1]  Amir Pnueli,et al.  The temporal logic of programs , 1977, 18th Annual Symposium on Foundations of Computer Science (sfcs 1977).

[2]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[3]  David L. Presotto,et al.  Publishing: a reliable broadcast communication mechanism , 1983, SOSP '83.

[4]  Anita Borg,et al.  A message system supporting fault tolerance , 1983, SOSP '83.

[5]  Fred B. Schneider,et al.  Byzantine generals in action: implementing fail-stop processors , 1984, TOCS.

[6]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[7]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[8]  David B. Johnson,et al.  Sender-Based Message Logging , 1987 .

[9]  David F. Bacon,et al.  Volatile logging in n-fault-tolerant distributed systems , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[10]  V. Rich Personal communication , 1989, Nature.

[11]  A. Prasad Sistla,et al.  Efficient distributed recovery using message logging , 1989, PODC '89.

[12]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[13]  Lorenzo Alvisi,et al.  Paralex: an environment for parallel programming in distributed systems , 1991, ICS '92.

[14]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[15]  Lorenzo Alvisi,et al.  Nonblocking and orphan-free message logging protocols , 1992, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[16]  Brian N. Bershad,et al.  The Midway distributed shared memory system , 1993, Digest of Papers. Compcon Spring.

[17]  Willy Zwaenepoel,et al.  Manetho: fault tolerance in distributed systems using rollback-recovery and process replication , 1994 .

[18]  Lorenzo Alvisi,et al.  Deriving optimal checkpoint protocols for distributed shared memory architectures , 1995, PODC '95.

[19]  Lorenzo Alvisi,et al.  Message logging: pessimistic, optimistic, and causal , 1995, Proceedings of 15th International Conference on Distributed Computing Systems.

[20]  Lorenzo Alvisi Understanding the message logging paradigm for masking process crashes , 1996 .

[21]  L. Alvisi,et al.  Message Logging: Pessimistic, Optimistic, Causal, and Optimal , 1998, IEEE Trans. Software Eng..