Low Overhead Soft Error Mitigation Techniques for High-Performance and Aggressive Designs

The threat of soft error induced system failure in computing systems has become more prominent, as we adopt ultradeep submicron process technologies. In this paper, we propose two efficient soft error mitigation schemes, namely, Soft Error Mitigation (SEM) and Soft and Timing Error Mitigation (STEM), using the approach of multiple clocking of data for protecting combinational logic blocks from soft errors. Our first technique, SEM, based on distributed and temporal voting of three registers, unloads the soft error detection overhead from the critical path of the systems. SEM is also capable of ignoring false errors and recovers from soft errors using in-situ fast recovery avoiding recomputation. Our second technique, STEM, while tolerating soft errors, adds timing error detection capability to guarantee reliable execution in aggressively clocked designs that enhance system performance by operating beyond worst-case clock frequency. We also present a specialized low overhead clock phase management scheme that ably supports our proposed techniques. Timing-annotated gate-level simulations, using 45 nm libraries, of a pipelined adder-multiplier and DLX processor show that both our techniques achieve near 100 percent fault coverage. For DLX processor, even under severe fault injection campaigns, SEM achieves an average performance improvement of 26.58 percent over a conventional triple modular redundancy voter-based soft error mitigation scheme, while STEM outperforms SEM by 27.42 percent.

[1]  Babak Falsafi,et al.  Reunion: Complexity-Effective Multicore Redundancy , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[2]  S. Narendra,et al.  Forward body bias for microprocessors in 130nm technology generation and beyond , 2002, 2002 Symposium on VLSI Circuits. Digest of Technical Papers (Cat. No.02CH37302).

[3]  M. Nicolaidis,et al.  Design for soft error mitigation , 2005, IEEE Transactions on Device and Materials Reliability.

[4]  Trevor Mudge,et al.  Razor: a low-power pipeline based on circuit-level timing speculation , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[5]  Sanjay Pant,et al.  A self-tuning DVS processor using delay-error detection and correction , 2005, IEEE Journal of Solid-State Circuits.

[6]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[7]  C. L. Chen,et al.  Symbol Error-Correcting Codes for Computer Memory Systems , 1992, IEEE Trans. Computers.

[8]  Arun K. Somani,et al.  REESE: a method of soft error detection in microprocessors , 2001, 2001 International Conference on Dependable Systems and Networks.

[9]  Robert P. Colwell The Zen of overclocking , 2004, Computer.

[10]  Muhammad Ashraful Alam,et al.  Reliability- and Process-variation aware design of integrated circuits — A broader perspective , 2008, 2011 International Reliability Physics Symposium.

[11]  Ming Zhang,et al.  Soft Error Resilient System Design through Error Correction , 2006, VLSI-SoC.

[12]  Paul D. Franzon,et al.  FreePDK: An Open-Source Variation-Aware Design Kit , 2007, 2007 IEEE International Conference on Microelectronic Systems Education (MSE'07).

[13]  Mingjing Chen,et al.  Improving Circuit Robustness with Cost-Effective Soft-Error-Tolerant Sequential Elements , 2007, 16th Asian Test Symposium (ATS 2007).

[14]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[15]  P. Eaton,et al.  Soft error rate mitigation techniques for modern microcircuits , 2002, 2002 IEEE International Reliability Physics Symposium. Proceedings. 40th Annual (Cat. No.02CH37320).

[16]  E. Normand Single event upset at ground level , 1996 .

[17]  Viswanathan Subramanian,et al.  Superscalar Processor Performance Enhancement through Reliable Dynamic Clock Frequency Tuning , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[18]  Huazhong Yang,et al.  Power optimized digitally programmable delay element , 2009 .

[19]  Hideo Ito,et al.  Soft Error Hardened FF Capable of Detecting Wide Error Pulse , 2008, 2008 IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems.

[20]  Gary S. Tyson,et al.  Guaranteeing Hits to Improve the Efficiency of a Small Instruction Cache , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[21]  James Tschanz,et al.  Parameter variations and impact on circuits and microarchitecture , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[22]  Jiri Gaisler A portable and fault-tolerant microprocessor based on the SPARC v8 architecture , 2002, Proceedings International Conference on Dependable Systems and Networks.

[23]  Mikel Anton Bezdek,et al.  Utilizing timing error detection and recovery to dynamically improve superscalar processor performance , 2006 .

[24]  Lorena Anghel,et al.  Cost reduction and evaluation of temporary faults detecting technique , 2000, DATE '00.

[25]  Augustus K. Uht,et al.  Uniprocessor performance enhancement through adaptive clock frequency control , 2005, IEEE Transactions on Computers.

[26]  B. Narasimham,et al.  Characterization of Digital Single Event Transient Pulse-Widths in 130-nm and 90-nm CMOS Technologies , 2007, IEEE Transactions on Nuclear Science.

[27]  D. Blaauw,et al.  Opportunities and challenges for better than worst-case design , 2005, Proceedings of the ASP-DAC 2005. Asia and South Pacific Design Automation Conference, 2005..

[28]  Viswanathan Subramanian,et al.  Low overhead Soft Error Mitigation techniques for high-performance and aggressive systems , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[29]  Albert Meixner,et al.  Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[30]  Keith A. Bowman,et al.  Impact of die-to-die and within-die parameter variations on the throughput distribution of multi-core processors , 2007, Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07).

[31]  Josep Torrellas,et al.  Paceline: Improving Single-Thread Performance in Nanoscale CMPs through Core Overclocking , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).