Definition and analysis of hardware- and software-fault-tolerant architectures

A structured definition of hardware- and software-fault-tolerant architectures is presented. Software-fault-tolerance methods are discussed, resulting in definitions for soft and solid faults. A soft software fault has a negligible likelihood or recurrence and is recoverable, whereas a solid software fault is recurrent under normal operations or cannot be recovered. A set of hardware- and software-fault-tolerant architectures is presented, and three of them are analyzed and evaluated. Architectures tolerating a single fault and architectures tolerating two consecutive faults are discussed separately. A sidebar addresses the cost issues related to software fault tolerance. The approach taken throughout is as general as possible, dealing with specific classes of faults or techniques only when necessary.<<ETX>>

[1]  S. Yau,et al.  Design of self-checking software , 1975, Reliable Software.

[2]  Atul Prakash,et al.  Software Engineering: Problems and Perspectives , 1984, Computer.

[3]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[4]  Brian Randell Design Fault Tolerance , 1986 .

[5]  Jim Gray,et al.  Why Do Computers Stop and What Can Be Done About It? , 1986, Symposium on Reliability in Distributed Software and Database Systems.

[6]  Jaynarayan H. Lala,et al.  Hardware and software fault tolerance: a unified architectural approach , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[7]  K. H. Kim,et al.  Distributed Execution of Recovery Blocks: An Approach for Uniform Treatment of Hardware and Software Faults in Real-Time Applications , 1989, IEEE Trans. Computers.

[8]  Jean Arlat,et al.  Hardware- and Software-Fault Tolerance , 1990 .