Reliable Computing Systems

The paper presents an analysis of the various problems involved in achieving very high reliability from complex computing systems, and discusses the relationship between system structuring techniques and techniques of fault tolerance. Topics covered include (i) differing types of reliability requirement, (ii) forms of protective redundancy in hardware and software systems, (iii) methods of structuring the activity of a system, using atomic actions, so as to limit information flow, (iv) error detection techniques, (v) strategies for locating and dealing with faults, and for assessing the damage they have caused, and (vi) forward and backward error recovery techniques, based on the concepts of recovery line, commitment, exception and compensation. A set of appendices provide summary descriptions and analyses of a number of computing systems that have been specifically designed with the aim of achieving very high reliability.

[1]  Irving L. Traiger,et al.  The notions of consistency and predicate locks in a database system , 1976, CACM.

[2]  Edsger W. Dijkstra,et al.  The structure of the “THE”-multiprogramming system , 1968, CACM.

[3]  Brian Randell System structure for software fault tolerance , 1975 .

[4]  David Lorge Parnas,et al.  Information Distribution Aspects of Design Methodology , 1971, IFIP Congress.

[5]  William R. Crowther,et al.  A new minicomputer/multiprocessor for the ARPA network , 1973, AFIPS National Computer Conference.

[6]  Santosh K. Shrivastava,et al.  Reliable Resource Allocation Betvveen Unreliable Processes , 1978, IEEE Transactions on Software Engineering.

[7]  David Lorge Parnas,et al.  Response to undesired events in software systems , 1976, ICSE '76.

[8]  Per Brinch Hansen,et al.  The programming language Concurrent Pascal , 1975, IEEE Transactions on Software Engineering.

[9]  C. A. R. Hoare,et al.  Monitors: an operating system structuring concept , 1974, CACM.

[10]  R. Kerr,et al.  Recovery blocks in action: A system supporting high reliability , 1976, ICSE '76.

[11]  Murray Edelberg Data base contamination and recovery , 1974, SIGFIDET '74.

[12]  John B. Goodenough,et al.  Exception handling: issues and a proposed notation , 1975, CACM.

[13]  Per Brinch Hansen,et al.  Operating System Principles , 1973 .

[14]  Christopher Strachey,et al.  OS6 - an experimental operating system for a small computer. Part 1: general principles and structure , 1972, Comput. J..

[15]  Theodore A. Linden Operating System Structures to Support Security and Reliable Software , 1976, CSUR.

[16]  J M Taylor Redundancy and Recovery in the HIVE Virtual Machine , 1976 .

[17]  D. A. Rennels,et al.  Fault-tolerance experiments with the JPL STAR computer. , 1972 .

[18]  Algirdas Avizienis,et al.  The STAR (Self-Testing And Repairing) Computer: An Investigation of the Theory and Practice of Fault-Tolerant Computer Design , 1971, IEEE Transactions on Computers.

[19]  Aviziens Fault-Tolerant Systems , 1976, IEEE Transactions on Computers.

[20]  Brian Randell,et al.  Process Structuring , 1973, CSUR.

[21]  K. Grace,et al.  Probabilistic Reliability: An Engineering Approach , 1968 .

[22]  William A. Wulf Reliable hardware/software architecture , 1975 .

[23]  William S. McPhee Operating System Integrity in OS/VS2 , 1974, IBM Syst. J..

[24]  John H. Wensley SIFT: software implemented fault tolerance , 1972, AFIPS '72 (Fall, part I).

[25]  Joost Verhofstad Recovery and crash resistance in a filing system , 1977, SIGMOD '77.

[26]  P. M. Melliar-Smith,et al.  A program structure for error detection and recovery , 1974, Symposium on Operating Systems.

[27]  Robert S. Fabry Dynamic verification of operating system decisions , 1973, CACM.