TFT: a software system for application-transparent fault tolerance

An important objective of software fault tolerant systems should be to provide a fault-tolerance infrastructure in a manner that minimizes the effort required by the application developer. In the limit, the objective is to provide fault tolerance transparently to the application. TFT, the work presented in this paper, provides transparent fault-tolerance at a higher interface than prior solutions. TFT coordinates replicas at the system call interface, interposing a supervisor agent between the application and the operating system. Moving the replica coordination to this interface allows uncorrelated faults within the operating system and below to be tolerated and also admits the possibility of online operating system and hardware upgrades. To accomplish its task, TFT must enforce a deterministic computation above the system call interface. The potential sources of non-determinism addressed include non-deterministic system calls, delivery of asynchronous events, and the representation of operating system abstractions that differ between replicas.

[1]  J. D. Day,et al.  A principle for resilient sharing of distributed resources , 1976, ICSE '76.

[2]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[3]  A.L. Hopkins,et al.  FTMP—A highly reliable fault-tolerant multiprocess for aircraft , 1978, Proceedings of the IEEE.

[4]  Joel F. Bartlett,et al.  A NonStop kernel , 1981, SOSP.

[5]  Richard D. Schlichting,et al.  Fail-stop processors: an approach to designing fault-tolerant computing systems , 1981, TOCS.

[6]  Bruce Walker,et al.  The LOCUS distributed operating system , 1983, SOSP '83.

[7]  Wolfgang Graetsch,et al.  Fault tolerance under UNIX , 1989, TOCS.

[8]  Fred B. Schneider,et al.  Implementing fault-tolerant services using the state machine approach: a tutorial , 1990, CSUR.

[9]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[10]  Fred B. Schneider,et al.  Hypervisor-based fault tolerance , 1996, TOCS.

[11]  Mark Russinovich,et al.  Replay for concurrent non-deterministic shared-memory applications , 1996, PLDI '96.

[12]  E. N. Elnozahy,et al.  Supporting nondeterministic execution in fault-tolerant systems , 1996, Proceedings of Annual Symposium on Fault Tolerant Computing.

[13]  Brian A. Coan,et al.  Tradeoffs when integrating multiple software components into a highly available application , 1997, Proceedings of SRDS'97: 16th IEEE Symposium on Reliable Distributed Systems.