A study of time redundant fault tolerance techniques for superscalar processors

As more and more transistors are incorporated into processor chips, the circuits are becoming more and more error-prone, necessitating the introduction of fault tolerance techniques. This paper investigates techniques to incorporate fault tolerance in superscalar processors by exploiting the functional unit redundancy available in these processors. The schemes investigated in this paper do not require any modifications to the instruction set architecture of the machine, and no additional instructions are added by the compiler. The paper also presents the results of a simulation study that we conducted to analyze the performance impact of the investigated fault tolerance schemes.

[1]  Kewal K. Saluja,et al.  A watchdog processor based general rollback technique with multiple retries , 1986, IEEE Transactions on Software Engineering.

[2]  John Paul Shen,et al.  Processor Control Flow Monitoring Using Signatured Instruction Streams , 1987, IEEE Transactions on Computers.

[3]  Krishan K. Sabnani,et al.  Spare Capacity as a Means of Fault Detection and Diagnosis in Multiprocessor Systems , 1989, IEEE Trans. Computers.

[4]  Arun K. Somani,et al.  Efficient utilization of spare capacity for fault detection and location in multiprocessor systems , 1992, [1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.

[5]  Jean-Claude Laprie,et al.  Saturation: reduced idleness for improved fault-tolerance , 1988, [1988] The Eighteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[6]  Gernot Metze,et al.  Fault Detection Capabilities of Alternating Logic , 1978, IEEE Transactions on Computers.

[7]  Kewal K. Saluja,et al.  A study of time-redundant fault tolerance techniques for high-performance pipelined computers , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[8]  Yale N. Patt,et al.  Checkpoint Repair for High-Performance Out-of-Order Execution Machines , 1987, IEEE Transactions on Computers.

[9]  John Paul Shen,et al.  Exploiting Instruction-Level Parallelism for Integrated Control-Flow Monitoring , 1994, IEEE Trans. Computers.

[10]  Janak H. Patel,et al.  Concurrent Error Detection in Multiply and Divide Arrays , 1983, IEEE Transactions on Computers.

[11]  Douglas M. Blough,et al.  Fault tolerance in super-scalar and vliw processors , 1991 .

[12]  Edward J. McCluskey,et al.  Concurrent Error Detection Using Watchdog Processors - A Survey , 1988, IEEE Trans. Computers.

[13]  Prithviraj Banerjee,et al.  Low Cost Concurrent Error Detection in a VLIW Architecture Using Replicated Instructions , 1992, ICPP.

[14]  Kang G. Shin,et al.  Optimal Checkpointing of Real-Time Tasks , 1987, IEEE Transactions on Computers.

[15]  Manoj Franklin Incorporating Fault Tolerance in the Multiscalar Fine-Grain Parallel Processor , 1995, ICPP.