Message fragment based causal message logging

In a distributed computing system, message logging is widely used for providing nodes with recoverability. To reduce the piggyback overhead of traditional causal message logging, we present a zoning causal message logging approach in this paper. The crux of the approach is to control the propagation of dependency information: the nodes in the system are divided into zones, and by a message fragment mechanism, the dependency information of a node is only visible in the zone scope. Simulation results show that the piggyback overhead of the proposed approach is lower than that of traditional causal message logging.

[1]  David L. Presotto,et al.  Publishing: a reliable broadcast communication mechanism , 1983, SOSP '83.

[2]  Lorenzo Alvisi,et al.  Causality tracking in causal message-logging protocols , 2002, Distributed Computing.

[3]  David F. Bacon,et al.  Volatile logging in n-fault-tolerant distributed systems , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[4]  David B. Johnson,et al.  Sender-Based Message Logging , 1987 .

[5]  Vijay K. Garg,et al.  Distributed recovery with K-optimistic logging , 2003, J. Parallel Distributed Comput..

[6]  Chita R. Das,et al.  Towards a communication characterization methodology for parallel applications , 1997, Proceedings Third International Symposium on High-Performance Computer Architecture.

[7]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1983, TOCS.

[8]  Heon Young Yeom,et al.  An efficient recovery scheme for fault-tolerant mobile computing systems , 2003, Future Gener. Comput. Syst..

[9]  Lorenzo Alvisi,et al.  Scalable causal message logging for wide‐area environments , 2003, Concurr. Comput. Pract. Exp..

[10]  Heon Young Yeom,et al.  An asynchronous recovery scheme based on optimistic message logging for mobile computing systems , 2000, Proceedings 20th IEEE International Conference on Distributed Computing Systems.

[11]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[12]  Sean W. Smith,et al.  Completely asynchronous optimistic recovery with minimal rollbacks , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[13]  Tong-Ying Tony Juang,et al.  Optimistic Crash Recovery without Changing Application Messages , 1997, IEEE Trans. Parallel Distributed Syst..

[14]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[15]  David J. Lilja,et al.  Characterization of Communication Patterns in Message-Passing Parallel Scientific Application Programs , 1998, CANPC.

[16]  Jeffrey S. Vetter,et al.  Communication characteristics of large-scale scientific applications for contemporary cluster architectures , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[17]  L. Alvisi,et al.  A Survey of Rollback-Recovery Protocols , 2002 .

[18]  Lorenzo Alvisi,et al.  Reasons for a pessimistic or optimistic message logging protocol in MPI uncoordinated failure, recovery , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[19]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[20]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.