Causality tracking in causal message-logging protocols

Abstract. Casual message-logging protocols have several attractive properties: they introduce no blocking, send no additional messages over those sent by the application, and never create orphans. Causal message logging, however, does require the casual effects of the deliveries of messages to be tracked. The information concerning causality tracking is piggybacked on application messages, and the amount of such information can become large.In this paper we study the cost of tracking causality in causal message-logging protocols. One can track causality as accurately as possible, but to do so requires piggybacking a considerable amount of additional information. One can reduce the amount of piggybacked information on each message by reducing the accuracy of causality tracking. But then, causal message logging may piggyback the reduced amount of information on more messages.We specify six different methods of tracking causality, each representing a natural choice based on the specification of causal message logging. We describe how these six methods can be implemented and compare them in terms of how large of a piggyback load they impose. This load depends on the application that is using causal message logging. We characterize some applications for which a given method has the smallest piggyback load, and study using simulation the size of the piggyback load for two different models of applications.

[1]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[2]  Robbert van Renesse Why bother with CATOCS? , 1994, OPSR.

[3]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[4]  Keith Marzullo,et al.  Detection of Global State Predicates , 1991, WDAG.

[5]  David F. Bacon,et al.  Volatile logging in n-fault-tolerant distributed systems , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[6]  Anoop Gupta,et al.  Modeling communication in parallel algorithms: a fruitful interaction between theory and systems? , 1994, SPAA '94.

[7]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[8]  A. Prasad Sistla,et al.  Efficient distributed recovery using message logging , 1989, PODC '89.

[9]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[10]  Lorenzo Alvisi,et al.  Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[11]  Friedemann Mattern,et al.  Virtual Time and Global States of Distributed Systems , 2002 .

[12]  Lorenzo Alvisi,et al.  Nonblocking and orphan-free message logging protocols , 1992, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[13]  F. Adragna,et al.  Synthetic Aperture Radar : New Processing Concepts , 1990, 10th Annual International Symposium on Geoscience and Remote Sensing.

[14]  Lorenzo Alvisi Understanding the message logging paradigm for masking process crashes , 1996 .

[15]  MatternFriedemann,et al.  Detecting causal relationships in distributed computations , 1994 .

[16]  Harrick M. Vin,et al.  Low-overhead protocols for fault-tolerant file sharing , 1998, Proceedings. 18th International Conference on Distributed Computing Systems (Cat. No.98CB36183).

[17]  Fred B. Schneider,et al.  Byzantine generals in action: implementing fail-stop processors , 1984, TOCS.

[18]  Subbarayan Venkatesan,et al.  An Optimal Algorithm for Distributed Snapshots with Causal Message Ordering , 1994, Inf. Process. Lett..

[19]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[20]  Harrick M. Vin,et al.  The cost of recovery in message logging protocols , 1998, Proceedings Seventeenth IEEE Symposium on Reliable Distributed Systems (Cat. No.98CB36281).

[21]  David L. Presotto,et al.  Publishing: a reliable broadcast communication mechanism , 1983, SOSP '83.

[22]  Chita R. Das,et al.  Towards a communication characterization methodology for parallel applications , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[23]  David B. Johnson,et al.  Sender-Based Message Logging , 1987 .

[24]  Anita Borg,et al.  A message system supporting fault tolerance , 1983, SOSP '83.