Fault Tolerance: Why Should I Pay for It?

Fault tolerant systems are not as widely used today as one might expect from an analysis of the costs of failures. System developers must consider other factors as well: where should development dollars be spent for maximum leverage? Will development in one area (e.g. fault tolerance) impede development in others? Development of fault tolerance techniques that are orthogonal to other development efforts must be a high priority. Market forces are driving a number of new technologies into products; our analysis suggests that these new technologies will change the trade-offs in both the performance cost and development cost areas.

[1]  David B. Johnson,et al.  Sender-Based Message Logging , 1987 .

[2]  David J. DeWitt,et al.  Parallel database systems: the future of high performance database systems , 1992, CACM.

[3]  André Schiper,et al.  Lightweight causal and atomic group multicast , 1991, TOCS.

[4]  Robert S. Swarz,et al.  The theory and practice of reliable system design , 1982 .

[5]  Bruce J. Walker,et al.  The LOCUS Distributed System Architecture , 1986 .

[6]  Nancy A. Lynch,et al.  Impossibility of distributed consensus with one faulty process , 1985, JACM.

[7]  Goetz Graefe,et al.  Encapsulation of parallelism in the Volcano query processing system , 1990, SIGMOD '90.

[8]  Kenneth P. Birman,et al.  Using process groups to implement failure detection in asynchronous environments , 1991, PODC '91.

[9]  R. Freiburghouse Making processing fail-safe , 1982 .

[10]  Stefano Ceri,et al.  Distributed Databases: Principles and Systems , 1984 .

[11]  Wolfgang Graetsch,et al.  Fault tolerance under UNIX , 1989, TOCS.

[12]  David B. Johnson,et al.  Recovery in Distributed Systems Using Optimistic Message Logging and Checkpointing , 1988, J. Algorithms.

[13]  Jim Lipkis,et al.  A Second-Generation Micro-Kernel Based UNIX: Lessons in Performance and Compatibility , 1991, USENIX Winter.

[14]  M. Y. Hsiao,et al.  Model for Transient and Permanent Error-Detection and Fault-Isolation Coverage , 1982, IBM J. Res. Dev..

[15]  Frederick F. Sellers,et al.  Error detecting logic for digital computers , 1968 .

[16]  Darren Price,et al.  Experience with SVR4 Over Chorus , 1992, USENIX Workshop on Microkernels and Other Kernel Architectures.

[17]  Robert E. Strom,et al.  Optimistic recovery in distributed systems , 1985, TOCS.

[18]  Leslie Lamport,et al.  Distributed snapshots: determining global states of distributed systems , 1985, TOCS.

[19]  Claude Kaiser,et al.  CHORUS Distributed Operating System , 1988, Comput. Syst..

[20]  Fred Hapgood,et al.  Up The Infinite Corridor: Mit And The Technical Imagination , 1993 .

[21]  Anita Borg,et al.  A message system supporting fault tolerance , 1983, SOSP '83.

[22]  Michel Gien,et al.  Revolution 89 or ''Distributing UNIX Brings it Back to its Original Virtues'' , 1990 .

[23]  RICHARD KOO,et al.  Checkpointing and Rollback-Recovery for Distributed Systems , 1986, IEEE Transactions on Software Engineering.

[24]  David L. Black,et al.  An OSF/1 UNIX for Massively Parallel Multicomputers , 1993, USENIX Winter.

[25]  Willy Zwaenepoel,et al.  Recovery in distributed systems using asynchronous message logging and checkpointing , 1988, PODC '88.

[26]  Kenneth P. Birman,et al.  Exploiting replication in distributed systems , 1990 .

[27]  Mahadev Satyanarayanan,et al.  Scalable, secure, and highly available distributed file access , 1990, Computer.

[28]  Nandakurnar N. Tendolkar,et al.  Automated diagnostic methodology for the IBM 3081 processor complex , 1982 .

[29]  Kenneth P. Birman,et al.  Exploiting virtual synchrony in distributed systems , 1987, SOSP '87.