In this paper, new schemes are presented in which the spare nodes of a permanent-fault-tolerant processing array are utilized in their idling state to aid in an online transient-error recovery process. Though spares-based methods are well-known solutions to permanent-fault tolerance, the cost of these solutions and the idling spare capacity during normal operation have limited their widespread use. Manufacturers must be offered fault tolerance solutions which provide useful work at all times. We propose the enhanced utility of spares-based methods by commissioning idling spares (those spares remaining after fabrication and subsequent replacement of faulty units) to perform transient-error recovery tasks. Our scheme will commission idling spares to perform periodic on-line testing (verifying whether system is functioning correctly), and recovery point validation during normal operation. When an error occurs, the spare will perform additional testing to select recovery points. Transient-error recovery is required in harsh environments, such as high radiation, where frequent transient errors are unavoidable. In these environments, the cost of job completion can be extremely high without some form of error recovery. Successful job completion can be attained in environments frequented by error bursts by identifying reliable data through the process of periodic on-line testing. We apply our scheme to a mesh array architecture that has applications in digital signal processing. Simulations highlight the overhead of our schemes in terms of job completion time in environments burdened with frequent transient random errors and burst errors. The proposed strategies for recovery are limited to systems of regular structure. There are many applications in signal and image processing that require array processing in which the various nodes perform similar operations with different data sets. Therefore, it is not necessary to switch the application algorithms for the spares when they perform redundant computation in a staggered mode. While this is a significant feature, there is a small cost associated with presenting the same data to a node as well as a spare. With built-in hardware and reconfiguration switches in the fault tolerant arrays, we believe this cost will be insignificant. Extension of our work to more general systems requires consideration of many issues including system timing, and sub-unit communication & dependence. This is a problem for future research.
[1]
Miroslaw Malek,et al.
A comparison connection assignment for diagnosis of multiprocessor systems
,
1980,
ISCA '80.
[2]
Paola Velardi,et al.
Hardware-Related Software Errors: Measurement and Analysis
,
1985,
IEEE Transactions on Software Engineering.
[3]
W. Kent Fuchs,et al.
Probability of correctness of processor-array outputs using periodic concurrent error detection
,
1996,
IEEE Trans. Reliab..
[4]
Ravishankar K. Iyer,et al.
DEPEND: a design environment for prediction and evaluation of system dependability
,
1990,
9th IEEE/AIAA/NASA Conference on Digital Avionics Systems.
[5]
Krishan K. Sabnani,et al.
Spare Capacity as a Means of Fault Detection and Diagnosis in Multiprocessor Systems
,
1989,
IEEE Trans. Computers.
[6]
Adit D. Singh.
Interstitial Redundancy: An Area Efficient Fault Tolerance Scheme for Large Area VLSI Processor Arrays
,
1988,
IEEE Trans. Computers.
[7]
Arun K. Somani,et al.
Efficient utilization of spare capacity for fault detection and location in multiprocessor systems
,
1992,
[1992] Digest of Papers. FTCS-22: The Twenty-Second International Symposium on Fault-Tolerant Computing.
[8]
Shambhu J. Upadhyaya,et al.
Utilizing spares in multichip modules for the dual function of fault coverage and fault diagnosis
,
1995,
Proceedings of International Workshop on Defect and Fault Tolerance in VLSI.