Recovery schemes for mesh arrays utilizing dedicated spares

In this paper, new schemes are presented in which the spare nodes of a permanent-fault-tolerant processing array are utilized in their idling state to aid in an online transient-error recovery process. Though spares-based methods are well-known solutions to permanent-fault tolerance, the cost of these solutions and the idling spare capacity during normal operation have limited their widespread use. Manufacturers must be offered fault tolerance solutions which provide useful work at all times. We propose the enhanced utility of spares-based methods by commissioning idling spares (those spares remaining after fabrication and subsequent replacement of faulty units) to perform transient-error recovery tasks. Our scheme will commission idling spares to perform periodic on-line testing (verifying whether system is functioning correctly), and recovery point validation during normal operation. When an error occurs, the spare will perform additional testing to select recovery points. Transient-error recovery is required in harsh environments, such as high radiation, where frequent transient errors are unavoidable. In these environments, the cost of job completion can be extremely high without some form of error recovery. Successful job completion can be attained in environments frequented by error bursts by identifying reliable data through the process of periodic on-line testing. We apply our scheme to a mesh array architecture that has applications in digital signal processing. Simulations highlight the overhead of our schemes in terms of job completion time in environments burdened with frequent transient random errors and burst errors. The proposed strategies for recovery are limited to systems of regular structure. There are many applications in signal and image processing that require array processing in which the various nodes perform similar operations with different data sets. Therefore, it is not necessary to switch the application algorithms for the spares when they perform redundant computation in a staggered mode. While this is a significant feature, there is a small cost associated with presenting the same data to a node as well as a spare. With built-in hardware and reconfiguration switches in the fault tolerant arrays, we believe this cost will be insignificant. Extension of our work to more general systems requires consideration of many issues including system timing, and sub-unit communication & dependence. This is a problem for future research.