Boosting the Performance of Software-Based Transient Errors Tolerant Techniques through Compiler Optimizations

This paper concentrates on studying to improve the performance of the state-of-the-art software-based fault tolerant technique — EDDI (Error Detection by Duplicated Instructions). We evaluate and analyze the performance of EDDI, and discover that some effective compiler optimizations targeting single-threaded programs with limited ILP become less effective for duplicated software. This paper then proposes the compiler-directed register de-replication to reduce the register pressure of EDDI. In addition, we evaluate the aggressive use of the delayed branch to exploit the control-independent instructions across both the original and the duplicated threads to further enhance the performance of EDDI. Our experimental results indicate that the performance overhead of EDDI can be reduced by up to 21.5%, with an average of 8.9%, by pure software optimizations.

[1]  Hyunki Kim,et al.  The design and evaluation of all voting triple modular redundancy system , 2002, Annual Reliability and Maintainability Symposium. 2002 Proceedings (Cat. No.02CH37318).

[2]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[3]  James C. Hoe,et al.  Dual use of superscalar datapath for transient-fault detection and recovery , 2001, Proceedings. 34th ACM/IEEE International Symposium on Microarchitecture. MICRO-34.

[4]  Gregory J. Chaitin,et al.  Register allocation & spilling via graph coloring , 1982, SIGPLAN '82.

[5]  Miodrag Potkonjak,et al.  MediaBench: a tool for evaluating and synthesizing multimedia and communications systems , 1997, Proceedings of 30th Annual International Symposium on Microarchitecture.

[6]  Steven S. Muchnick,et al.  Advanced Compiler Design and Implementation , 1997 .

[7]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[8]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[9]  Prithviraj Banerjee,et al.  Low Cost Concurrent Error Detection in a VLIW Architecture Using Replicated Instructions , 1992, ICPP.

[10]  Cristiana Bolchini A software methodology for detecting hardware faults in VLIW data paths , 2003, IEEE Trans. Reliab..

[11]  Irith Pomeranz,et al.  Transient-fault recovery using simultaneous multithreading , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[12]  Edward J. McCluskey,et al.  Concurrent Error Detection Using Watchdog Processors - A Survey , 1988, IEEE Trans. Computers.

[13]  Chin-Long Chen,et al.  Error-Correcting Codes for Semiconductor Memory Applications: A State-of-the-Art Review , 1984, IBM J. Res. Dev..

[14]  Edward J. McCluskey,et al.  Low Energy Error Detection Technique Using Procedure Call Duplication , 2001 .

[15]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[16]  Irith Pomeranz,et al.  Transient-fault recovery for chip multiprocessors , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[17]  Scott Mahlke,et al.  Effective compiler support for predicated execution using the hyperblock , 1992, MICRO 1992.

[18]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[19]  Babak Falsafi,et al.  Efficient Resource Sharing in Concurrent Error Detecting Superscalar Microarchitectures , 2004, 37th International Symposium on Microarchitecture (MICRO-37'04).