Fault Tolerance and System Structuring

We discuss a general approach to the design of fault-tolerant computing systems, concentrating on issues of system structuring rather than on the design of particular algorithms. Three forms of structuring are described. The first is based on the use of what we term “idealized fault-tolerant components”. Such components provide a means of system structuring which makes it easy to identify what parts of a system have what responsibilities for trying to cope with what sorts of faults. The second is a “recursive structuring” scheme. It involves using complete computers as the basic idealized fault-tolerant components of a distributed computing system whose functionality matches that of its component computers. Finally we discuss a generalization of the usual concept of an “atomic action”, which provides a means of structuring both forward and backward error recovery in distributed systems. These discussions are given in general terms, and also illustrated by brief accounts of recent and current work at Newcastle on the construction of UNIX-based fault-tolerant and distributed systems.

[1]  Carl A. Sunshine,et al.  Connection Management in Transport Protocols , 1978, Comput. Networks.

[2]  Flaviu Cristian,et al.  Exception Handling and Software Fault Tolerance , 1982, IEEE Transactions on Computers.

[3]  P. M. Melliar-Smith,et al.  A program structure for error detection and recovery , 1974, Symposium on Operating Systems.

[4]  Hugh C. Lauer,et al.  A recursive virtual machine architecture , 1973 .

[5]  Brian Randell,et al.  The newcastle connection or UNIXes of the world unite! , 1982, Softw. Pract. Exp..

[6]  Alan Snyder,et al.  Exception Handling in CLU , 1979, IEEE Transactions on Software Engineering.

[7]  W. C. Carter Hardware fault tolerance , 1986 .

[8]  Bruce Jay Nelson Remote procedure call , 1981 .

[9]  Jim Gray,et al.  Notes on Data Base Operating Systems , 1978, Advanced Course: Operating Systems.

[10]  Santosh K. Shrivastava,et al.  Reliable Remote Calls for Distributed UNIX: An Implementation Study , 1985 .

[11]  Brian Randell System structure for software fault tolerance , 1975 .

[12]  Jerome H. Saltzer,et al.  End-to-end arguments in system design , 1984, TOCS.

[13]  Brian Randell,et al.  Consistent State Restoration in Distributed Systems , 1977 .

[14]  Brian Randell,et al.  Error recovery in asynchronous systems , 1986, IEEE Transactions on Software Engineering.