Exploitation of instruction-level parallelism for detection of processor execution errors

As device geometries decrease and processor clock frequency increases, the incidence of hardware transient errors increases. Simultaneously, computer architectures are using increased degrees of instruction-level resource parallelism to achieve performance goals, e.g. pipelined, superscalar and Very Long Instruction Word (VLIW) processors. Full utilization of this parallelism is difficult to achieve and sustain, resulting in the occurrence of idle resources. This thesis explores the use of such idle resources for concurrent error detection in processors employing instruction-level resource parallelism. Focus is on the detection of errors in program control-flow and program data. An experimental approach is taken in which a commercial VLIW processor, the Multiflow TRACE 14/300, is selected as the target processor. The resource utilization of the TRACE 14/300 during execution of 11 scientific benchmark programs is examined. Experimental evaluation reveals that resource utilization is low. Fundamental factors limiting the resource utilization are identified. These factors indicate that significant idle resources are likely to exist across a wide range of applications for the TRACE 14/300 as well as other processors employing a significant amount of instruction-level parallelism. A methodology is developed to utilize idle processor resources, called Available Resource-driven Control-flow monitoring (ARC), for detecting transient control-flow errors. It is unique in that the monitoring computation's resource use is tailored to the existence of idle resources in the application processor. An algorithm for the implementation of the ARC-based monitoring computation and results characterizing its error detection properties are presented. The results demonstrate that ARC is highly effective in using the idle resources of a processor to achieve concurrent error detection at a very low cost in performance overhead. Finally, a technique for detecting errors in program data, called Algorithm-Based Fault Tolerance (ABFT), is applied to the TRACE 14/300. It is found that the degree to which ABFT is able to make use of idle resources varies considerably, depending upon the application, while detecting a high percentage of data errors. Overall results demonstrate that concurrent error detection techniques can significantly reduce their hardware and performance overhead by use of idle resources in processors employing instruction-level resource parallelism, while achieving effective error coverage.