Fault-Tolerant Computing: An Introduction and a Perspective

FAULT-TOLERANT computing has been defined as "the ability to execute specified algorithms correctly regardless of hardware failures, total system flaws, or program fallacies" [1]. To the extent that a system falls short of meeting the requirements of this definition, it can be labeled a partially fault-tolerant system [2]. Thus the definition of fault-tolerant computing provides a standard against which to measure all systems having a degree of fault tolerance. In particular, one can classify systems according to: 1), the amount of manual intervention required in performing three basic functions, and 2) the class of faults covered by three basic functions involved in fault tolerance: system validation, fault diagnosis, and fault masking or recovery. The word "fault" here is used to inclusively describe "failures, flaws, and fallacies" in the original definition. The first function is involved in the design and production of the system hardware and software, while the last two functions are embodied in the system itself. Likewise, the first function is directed to handling faults arising from design and production errors, whereas the last two functions are aimed at faults due to random hardware failures.

[1]  George B. Leeman Some Problems in Certifying Microprograms , 1975, IEEE Transactions on Computers.

[2]  William E. Howden,et al.  Methodology for the Generation of Program Test Data , 1975, IEEE Transactions on Computers.

[3]  Gabriele Saucier,et al.  Diversified Test Methods for Local Control Units , 1975, IEEE Transactions on Computers.

[4]  Jacob A. Abraham A Combinatorial Solution to the Reliability of Interwoven Redundant Logic Networks , 1975, IEEE Transactions on Computers.

[5]  Francisco J. O. Dias Fault Masking in Combinational Logic Circuits , 1975, IEEE Transactions on Computers.

[6]  Douglas C. Bossen,et al.  Orthogonal Latin Square Configuration for LSI Memory Yield and Reliability Enhancement , 1975, IEEE Transactions on Computers.

[7]  John F. Meyer,et al.  On-Line Diagnosis of Unrestricted Faults , 1975, IEEE Transactions on Computers.

[8]  Daniel P. Siewiorek Reliability Modeling of Compensating Module Failures in Majority Voted Redundancy , 1975, IEEE Transactions on Computers.

[9]  William C. Carter Fault-Tolerant Computing: An Introduction and a Viewpoint , 1973, IEEE Transactions on Computers.

[10]  T. Basil Smith,et al.  The Architectural Elements of a Symmetric Fault-Tolerant Multiprocessor , 1975, IEEE Transactions on Computers.

[11]  Edward J. McCluskey,et al.  Analysis of Logic Circuits with Faults Using Input Signal Probabilities , 1975, IEEE Transactions on Computers.

[12]  John E. Bauer,et al.  An Advanced Fault Isolation System for Digital Logic , 1975, IEEE Transactions on Computers.

[13]  Dwight H. Sawin Design of Reliable Synchronous Sequential Circuits , 1975, IEEE Transactions on Computers.

[14]  John F. Wakerly,et al.  Transient Failures in Triple Modular Redundancy Systems with Sequential Modules , 1975, IEEE Transactions on Computers.

[15]  T. Basil Smith A Damage- and Fault-Tolerant Input/Output Network , 1975, IEEE Transactions on Computers.

[16]  Barry R. Borgerson,et al.  A Reliability Model for Gracefully Degrading and Standby-Sparing Systems , 1975, IEEE Transactions on Computers.

[17]  H. Y. Chang,et al.  Methods of interpreting diagnostic data for locating faults in digital machines , 1967 .

[18]  H. Y. Chang,et al.  Lamp: Controllability, observability, and maintenance engineering technique (comet) , 1974 .

[19]  Alan M. Usas A Totally Self-Checking Checker Design for the Detection of Errors in Periodic Signals , 1975, IEEE Transactions on Computers.

[20]  Samir Kamal,et al.  An Approach to the Diagnosis of Intermittent Faults , 1975, IEEE Transactions on Computers.

[21]  Roy C. Ogus,et al.  The Probability of a Correct Output from a Combinational Circuit , 1975, IEEE Transactions on Computers.