Lessons from FTM: An Experiment in Design and Implementation of a Low-Cost Fault-Tolerant System

This paper describes an experiment in the design of a general purpose fault tolerant system, FTM. The main objective of the FTM design was to implement a low-cost fault-tolerant system that could be used on standard workstations. At the operating system level, the authors' goal was to offer fault-tolerance transparency to user applications. In other words, porting an application to FTM need only require compiling the source code without having to modify it. These objectives were achieved using the Mach micro-kernel and a modular set of reliable servers which implement application checkpoints and provide continuous system functions despite machine crashes. At the architectural level, their approach relies on a high-performance stable storage implementation, called stable transactional memory (STM), which can be implemented either by hardware or software. The authors first motivate their design choices, then detail the FTM implementation at both architectural and operating system level. They discuss the reasons for the evolution of their stable memory technology from hardware to software. They evaluate the performance of the FTM prototype. They conclude with lessons learned and give some assessments.

[1]  B. Randell,et al.  STATE RESTORATION IN DISTRIBUTED SYSTEMS , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[2]  Takashi Masuda,et al.  Designing an Extensible Distributed Language with a Meta-Level Architecture , 1993, ECOOP.

[3]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[4]  S. Venkatesan,et al.  Crash recovery with little overhead , 1991, [1991] Proceedings. 11th International Conference on Distributed Computing Systems.

[5]  Michel Raynal,et al.  Synchronization and control of distributed systems and programs , 1990, Wiley series in parallel computing.

[6]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[7]  Luke Lin,et al.  Using checkpoints to localize the effects of faults in distributed systems , 1989, Proceedings of the Eighth Symposium on Reliable Distributed Systems.

[8]  Gilles Muller,et al.  A stable transactional memory for building robust object oriented programs , 1991 .

[9]  Bharat K. Bhargava,et al.  Independent checkpointing and concurrent rollback for recovery in distributed systems-an optimistic approach , 1988, Proceedings [1988] Seventh Symposium on Reliable Distributed Systems.

[10]  Wolfgang Graetsch,et al.  Fault tolerance under UNIX , 1989, TOCS.

[11]  Michel Banâtre,et al.  Design decisions for the FTM: a general purpose fault tolerant machine , 1991, [1991] Digest of Papers. Fault-Tolerant Computing: The Twenty-First International Symposium.

[12]  Guy Lapalme,et al.  The design and building of Enchère, a distributed electronic marketing system , 1986, CACM.

[13]  Flaviu Cristian,et al.  A timestamp-based checkpointing protocol for long-lived distributed computations , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[14]  Michel Banâtre,et al.  How to Design Reliable Servers using Fault Tolerant Micro-Kernel Mechanisms , 1991, USENIX MACH Symposium.

[15]  Luís Moura Silva,et al.  Global checkpointing for distributed programs , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[16]  Mark Cameron Little,et al.  Object replication in a distributed system , 1991 .

[17]  Luke Lin,et al.  Checkpointing and rollback-recovery in distributed object based systems , 1990, [1990] Digest of Papers. Fault-Tolerant Computing: 20th International Symposium.

[18]  Bharat K. Bhargava,et al.  A model for concurrent checkpointing and recovery using transactions , 1989, [1989] Proceedings. The 9th International Conference on Distributed Computing Systems.

[19]  Barry J. Gleeson,et al.  Fault Tolerance: Why Should I Pay for It? , 1994, Hardware and Software Architectures for Fault Tolerance.

[20]  Lily B. Mummert,et al.  Camelot and Avalon: A Distributed Transaction Facility , 1991 .

[21]  Santosh K. Shrivastava,et al.  Exploiting Type Inheritance Facilities to Implement Recoverability in Object Based Systems , 1987, SRDS.

[22]  Mark Russinovich,et al.  Application transparent fault management in fault tolerant Mach , 1993, FTCS-23 The Twenty-Third International Symposium on Fault-Tolerant Computing.

[23]  Michel Banâtre,et al.  Ensuring data security and integrity with a fast stable storage , 1988, Proceedings. Fourth International Conference on Data Engineering.

[24]  Anoop Gupta,et al.  SPLASH: Stanford parallel applications for shared-memory , 1992, CARN.

[25]  Thomas Anderson,et al.  Fault Tolerant Systems , 1990 .

[26]  Reinhold Kröger,et al.  Recovery-management in the RelaX distributed transaction layer , 1989, Proceedings of the Eighth Symposium on Reliable Distributed Systems.

[27]  Frank B. Schmuck,et al.  Experience with transactions in QuickSilver , 1991, SOSP '91.

[28]  Robbert van Renesse,et al.  Amoeba A Distributed Operating System for the 1990 s Sape , 1990 .

[29]  Roger L. Haskin,et al.  Recovery management in QuickSilver , 1988, TOCS.

[30]  Bruno Rochat Une approche a la construction de services fiables dans les systemes distribues , 1992 .

[31]  Butler W. Lampson,et al.  Atomic Transactions , 1980, Advanced Course: Distributed Systems.

[32]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[33]  Jeffrey F. Naughton,et al.  Checkpointing multicomputer applications , 1991, [1991] Proceedings Tenth Symposium on Reliable Distributed Systems.

[34]  Brian Randell,et al.  Designing Secure and Reliable Applications using Fragmentation-Redundancy-Scattering: An Object-Oriented Approach , 1994, EDCC.

[35]  Willy Zwaenepoel,et al.  Manetho: Transparent Rollback-Recovery with Low Overhead, Limited Rollback, and Fast Output Commit , 1992, IEEE Trans. Computers.

[36]  D. Jewett,et al.  Integrity S2: A Fault-Tolerant Unix Platform , 1991, Twenty-Fifth International Symposium on Fault-Tolerant Computing, 1995, ' Highlights from Twenty-Five Years'..

[37]  Pankaj Jalote,et al.  Fault tolerance in distributed systems , 1994 .

[38]  Bruce Jay Nelson Remote procedure call , 1981 .

[39]  Barbara Liskov,et al.  Implementation of Argus , 1987, SOSP '87.

[40]  Yuval Tamir,et al.  ERROR RECOVERY IN MULTICOMPUTERS USING GLOBAL CHECKPOINTS , 1984 .

[41]  Gilles Muller,et al.  Performance of Consistent Checkpointing in a Modular Operating System: Results of the FTM Experiment , 1994, EDCC.

[42]  Arthur P. Goldberg Transparent Recovery of Mach Applications , 1990, USENIX MACH Symposium.

[43]  William J. Bolosky,et al.  Mach: A New Kernel Foundation for UNIX Development , 1986, USENIX Summer.

[44]  Michel Banâtre,et al.  An experience in the design of a reliable object based system , 1993, [1993] Proceedings of the Second International Conference on Parallel and Distributed Information Systems.

[45]  Bharat K. Bhargava,et al.  Concurrent robust checkpointing and recovery in distributed systems , 1988, Proceedings. Fourth International Conference on Data Engineering.

[46]  Willy Zwaenepoel,et al.  The performance of consistent checkpointing , 1992, [1992] Proceedings 11th Symposium on Reliable Distributed Systems.

[47]  James P. Black,et al.  Redundancy in Data Structures: Improving Software Fault Tolerance , 1980, IEEE Transactions on Software Engineering.