A fault tolerance infrastructure for dependable computing with high-performance COTS components

The failure rates of current COTS processors have dropped to 100 FITs (failures per 10/sup 9/ hours), indicating a potential MTTF of over 1100 years. However our recent study of Intel P6 family processors has shown that they have very limited error detection and recovery capabilities and contain numerous design faults ("errata"). Other limitations are susceptibility to transient faults and uncertainty about "wearout" that could increase the failure rate in time. Because of these limitations, an external fault tolerance infrastructure is needed to assure the dependability of a system with such COTS components. The paper describes a fault-tolerant "infrastructure" system of fault tolerance functions that makes possible the use of low-coverage COTS processors in a fault-tolerant, self-repairing system. The custom hardware supports transient recovery design fault tolerance, and self-repair by scaring and replacement. Fault tolerance functions are implemented by four types of hardware are processors of low complexity that are fault-tolerant. High error detection coverage, including design faults, is attained by diversity and replication.

[1]  Algirdas Avizienis,et al.  The N-Version Approach to Fault-Tolerant Software , 1985, IEEE Transactions on Software Engineering.

[2]  Tzu-I Jonathan Fan Fault tolerant clocking system , 1978 .

[3]  Algirdas Avizienis,et al.  Assessment of the applicability of COTS microprocessors in high-confidence computing systems: a case study , 2000, Proceeding International Conference on Dependable Systems and Networks. DSN 2000.

[4]  A. Avizienis,et al.  Microprocessor entomology: a taxonomy of design faults in COTS microprocessors , 1999, Dependable Computing for Critical Applications 7.

[5]  Algirdas Avizienis The hundred year spacecraft , 1999, Proceedings of the First NASA/DoD Workshop on Evolvable Hardware.

[6]  Algirdas Avizienis,et al.  The STAR (Self-Testing And Repairing) Computer: An Investigation of the Theory and Practice of Fault-Tolerant Computer Design , 1971, IEEE Transactions on Computers.

[7]  T V Johnson,et al.  The Galileo mission to Jupiter and its moons. , 2000, Scientific American.

[8]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[9]  Algirdas Avizienis,et al.  Fault Tolerance by Design Diversity: Concepts and Experiments , 1984, Computer.

[10]  Algirdas Avizienis,et al.  Toward Systematic Design of Fault-Tolerant Systems , 1997, Computer.