Fail-stop processors: an approach to designing fault-tolerant computing systems

A methodology that facilitates the design of fault-tolerant computing systems is presented. It is based on the notion of a fail-stop processor. Such a processor automatically halts in response to any internal failure and does so before the effects of that failure become visible. The problem of implementing processors that, with high probability, behave like fail-stop processors is addressed. Axiomatic program verification techniques are described for use in developing provably correct programs for fail-stop processors. The design of a process control system illustrates the use of our methodology.

[1]  Philip G. Johnson Cornell University , 1897, The Journal of comparative medicine and veterinary archives.

[2]  Z. A. Lomnicki,et al.  Mathematical Theory of Reliability , 1966 .

[3]  C. A. R. Hoare,et al.  An Axiomatic Definition of the Programming Language PASCAL , 1973, Acta Informatica.

[4]  Edsger W. Dijkstra,et al.  A Discipline of Programming , 1976 .

[5]  Butler W. Lampson,et al.  Crash Recovery in a Distributed Data Storage System , 1981 .

[6]  Peter J. Denning,et al.  Fault Tolerant Operating Systems , 1976, CSUR.

[7]  Aviziens Fault-Tolerant Systems , 1976, IEEE Transactions on Computers.

[8]  Jim Gray,et al.  Notes on Data Base Operating Systems , 1978, Advanced Course: Operating Systems.

[9]  Brian Randell,et al.  Reliability Issues in Computing System Design , 1978, CSUR.

[10]  Leslie Lamport,et al.  Time, clocks, and the ordering of events in a distributed system , 1978, CACM.

[11]  Brian Randell,et al.  Operating Systems, An Advanced Course , 1978 .

[12]  A.L. Hopkins,et al.  FTMP—A highly reliable fault-tolerant multiprocess for aircraft , 1978, Proceedings of the IEEE.

[13]  J. Goldberg,et al.  SIFT: Design and analysis of a fault-tolerant computer for aircraft control , 1978, Proceedings of the IEEE.

[14]  A. Pnueli The Temporal Semantics of Concurrent Programs , 1979, Theor. Comput. Sci..

[15]  Leslie Lamport,et al.  Reaching Agreement in the Presence of Faults , 1980, JACM.

[16]  Butler W. Lampson,et al.  Atomic Transactions , 1980, Advanced Course: Distributed Systems.

[17]  Butler W. Lampson,et al.  Distributed Systems — Architecture and Implementation , 1982, Lecture Notes in Computer Science.

[18]  Arthur J. Bernstein,et al.  Proving real-time properties of programs with temporal logic , 1981, SOSP.

[19]  David Gries,et al.  The Science of Programming , 1981, Text and Monographs in Computer Science.

[20]  Amir Pnueli The Temporal Semantics of Concurrent Programs , 1981, Theor. Comput. Sci..

[21]  Butler W. Lampson,et al.  Distributed Systems - Architecture and Implementation, An Advanced Course , 1981, Advanced Course: Distributed Systems.

[22]  Richard D. Schlichting,et al.  Understanding and using asynchronous message passing (Preliminary Version) , 1982, PODC '82.

[23]  Robert S. Swarz,et al.  The theory and practice of reliable system design , 1982 .

[24]  Fred B. Schneider,et al.  Synchronization in Distributed Programs , 1982, TOPL.

[25]  Leslie Lamport,et al.  Proving Liveness Properties of Concurrent Programs , 1982, TOPL.

[26]  Leslie Lamport,et al.  The Byzantine Generals Problem , 1982, TOPL.

[27]  Richard Dale Schlichting,et al.  Axiomatic Verification to Enhance Software Reliability , 1982 .

[28]  Leslie Lamport,et al.  Using Time Instead of Timeout for Fault-Tolerant Distributed Systems. , 1984, TOPL.