Study of fault-tolerant software technology

Presented is an overview of the current state of the art of fault-tolerant software and an analysis of quantitative techniques and models developed to assess its impact. It examines research efforts as well as experience gained from commercial application of these techniques. The paper also addresses the computer architecture and design implications on hardware, operating systems and programming languages (including Ada) of using fault-tolerant software in real-time aerospace applications. It concludes that fault-tolerant software has progressed beyond the pure research state. The paper also finds that, although not perfectly matched, newer architectural and language capabilities provide many of the notations and functions needed to effectively and efficiently implement software fault-tolerance.

[1]  Herbert Hecht Fault-Tolerant Software , 1979, IEEE Transactions on Reliability.

[2]  Peter Wegner On the unification of data and program abstraction in Ada , 1983, POPL '83.

[3]  Santosh K. Shrivastava,et al.  Reliable Resource Allocation Betvveen Unreliable Processes , 1978, IEEE Transactions on Software Engineering.

[4]  R. J. Willett,et al.  Design of recovery strategies for a fault-tolerant No. 4 Electronic Switching System , 1982, The Bell System Technical Journal.

[5]  Wolfgang K. Giloi,et al.  Hierarchical function distribution - a design principle for advanced multicomputer architectures , 1983, ISCA '83.

[6]  William R. Crowther,et al.  Pluribus: a reliable multiprocessor , 1975, AFIPS '75.

[7]  Brian Randell Reliable Computing Systems , 1978, Advanced Course: Operating Systems.

[8]  C. A. R. Hoare,et al.  The emperor's old clothes , 1981, CACM.

[9]  P. Lee Structuring software systems for fault tolerance , 1983 .

[10]  N. Ghani,et al.  A Recovery Cache for the PDP-11 , 1980, IEEE Transactions on Computers.

[11]  Stephen S. Yau,et al.  An Approach to Concurrent Control Flow Checking , 1980, IEEE Transactions on Software Engineering.

[12]  J. C. Knight,et al.  Fault tolerant distributed systems using Ada , 1983 .

[13]  Flaviu Cristian,et al.  Correct and Robust Programs , 1984, IEEE Transactions on Software Engineering.

[14]  J. C. Knight,et al.  On the engineering of crucial software , 1983 .

[15]  C. V. Ramamoorthy,et al.  Software Reliability—Status and Perspectives , 1982, IEEE Transactions on Software Engineering.