Supporting nondeterministic execution in fault-tolerant systems

We present a technique to track nondeterminism resulting from asynchronous events and multithreading in log-based rollback-recovery protocols. This technique relies on using a software counter to compute the number of instructions between nondeterministic events in normal operation. Should a failure occur, the instruction counts are used to force the replay of these events at the same execution points. The execution of the application thus can be replayed to recreate the pre-failure state, while accommodating uncontrolled nondeterminism during normal operation. Implementation on a DEC Alpha processor shows that this support has a low overhead, typically less than 6% increase in running time for the applications we studied.

[1]  A. Prasad Sistla,et al.  Efficient distributed recovery using message logging , 1989, PODC '89.

[2]  Thomas A. Cargill,et al.  Cheap hardware support for software debugging and profiling , 1987, ASPLOS 1987.

[3]  Willy Zwaenepoel,et al.  Manetho: fault tolerance in distributed systems using rollback-recovery and process replication , 1994 .

[4]  Yi-Min Wang,et al.  Checkpointing and its applications , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[5]  Paulo Veríssimo,et al.  The Delta-4 extra performance architecture (XPA) , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[6]  P. Reynier,et al.  Active replication in Delta-4 , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[7]  W. Kent Fuchs,et al.  Scheduling message processing for reducing rollback propagation , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[8]  Nitin H. Vaidya,et al.  A case for two-level distributed recovery schemes , 1995, SIGMETRICS '95/PERFORMANCE '95.

[9]  Mark Russinovich,et al.  Replay for concurrent non-deterministic shared-memory applications , 1996, PLDI '96.

[10]  Yi-Min Wang,et al.  Why optimistic message logging has not been used in telecommunications systems , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[11]  Jian Xu,et al.  Adaptive message logging for incremental program replay , 1993, IEEE Parallel & Distributed Technology: Systems & Applications.

[12]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[13]  S. Venkatesan,et al.  Crash recovery with little overhead , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[14]  Yennun Huang,et al.  Software Implemented Fault Tolerance Technologies and Experience , 1993, FTCS.

[15]  David B. Johnson,et al.  Sender-Based Message Logging , 1987 .

[16]  Lorenzo Alvisi,et al.  Nonblocking and orphan-free message logging protocols , 1992, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[17]  Anita Borg,et al.  A message system supporting fault tolerance , 1983, SOSP '83.

[18]  Willy Zwaenepoel,et al.  On the use and implementation of message logging , 1994, Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing.

[19]  David L. Presotto,et al.  Publishing: a reliable broadcast communication mechanism , 1983, SOSP '83.

[20]  Ray Templeton ‘Public domain’ software , 1983 .

[21]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[22]  Elmootazbellah Nabil Elnozahy An Efficient Technique for Tracking Nondeterministic Execution and its Applications , 1995 .

[23]  W. Kent Fuchs,et al.  Reduced overhead logging for rollback recovery in distributed shared memory , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[24]  Mark Russinovich,et al.  Application transparent fault management in fault tolerant Mach , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[25]  Lorenzo Alvisi,et al.  Message logging: pessimistic, optimistic, and causal , 1995, Proceedings of 15th International Conference on Distributed Computing Systems.

[26]  Arthur P. Goldberg Transparent Recovery of Mach Applications , 1990, USENIX MACH Symposium.

[27]  Fred B. Schneider,et al.  Hypervisor-based fault tolerance , 1996, TOCS.

[28]  Thomas J. LeBlanc,et al.  A software instruction counter , 1989, ASPLOS 1989.

[29]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[30]  David R. Cheriton,et al.  The V distributed system , 1988, CACM.

[31]  Divyakant Agrawal,et al.  Using message semantics to reduce rollback in optimistic message logging recovery schemes , 1994, 14th International Conference on Distributed Computing Systems.

[32]  Yi-Min Wang,et al.  Reducing message logging overhead for log-based recovery , 1993, 1993 IEEE International Symposium on Circuits and Systems.

[33]  David B. Johnson,et al.  Distributed system fault tolerance using message logging and checkpointing , 1990 .

[34]  Sarita V. Adve,et al.  RSIM Reference Manual: Version 1.0 , 1997 .

[35]  W. Kent Fuchs,et al.  Optimistic message logging for independent checkpointing in message-passing systems , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[36]  Wolfgang Graetsch,et al.  Fault tolerance under UNIX , 1989, TOCS.

[37]  David F. Bacon,et al.  File system measurements and their application to the design of efficient operation logging algorithms , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.