Low overhead Soft Error Mitigation techniques for high-performance and aggressive systems

The threat of soft error induced system failure in computing systems has become more prominent, as we adopt ultradeep submicron process technologies. In this paper, we propose two efficient soft error mitigation schemes, namely, Soft Error Mitigation (SEM) and Soft and Timing Error Mitigation (STEM), using the approach of multiple clocking of data for protecting combinational logic blocks from soft errors. Our first technique, SEM, based on distributed and temporal voting of three registers, unloads the soft error detection overhead from the critical path of the systems. SEM is also capable of ignoring false errors and recovers from soft errors using in-situ fast recovery avoiding recomputation. Our second technique, STEM, while tolerating soft errors, adds timing error detection capability to guarantee reliable execution in aggressively clocked designs that enhance system performance by operating beyond worst-case clock frequency. We also present a specialized low overhead clock phase management scheme that ably supports our proposed techniques. Timing-annotated gate-level simulations, using 45 nm libraries, of a pipelined adder-multiplier and DLX processor show that both our techniques achieve near 100 percent fault coverage. For DLX processor, even under severe fault injection campaigns, SEM achieves an average performance improvement of 26.58 percent over a conventional triple modular redundancy voter-based soft error mitigation scheme, while STEM outperforms SEM by 27.42 percent.

[1]  Keith A. Bowman,et al.  Impact of die-to-die and within-die parameter variations on the throughput distribution of multi-core processors , 2007, Proceedings of the 2007 international symposium on Low power electronics and design (ISLPED '07).

[2]  Viswanathan Subramanian,et al.  Superscalar Processor Performance Enhancement through Reliable Dynamic Clock Frequency Tuning , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[3]  Albert Meixner,et al.  Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2008, IEEE Micro.

[4]  David Blaauw,et al.  Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation , 2003, MICRO.

[5]  Augustus K. Uht Achieving Typical Delays in Synchronous Systems via Timing Error Toleration , 2000 .

[6]  Arun K. Somani,et al.  Low Overhead Soft Error Mitigation Techniques for High-Performance and Aggressive Designs , 2012, IEEE Trans. Computers.

[7]  C. L. Chen,et al.  Symbol Error-Correcting Codes for Computer Memory Systems , 1992, IEEE Trans. Computers.

[8]  Arun K. Somani,et al.  REESE: a method of soft error detection in microprocessors , 2001, 2001 International Conference on Dependable Systems and Networks.

[9]  Huazhong Yang,et al.  Power optimized digitally programmable delay element , 2009 .

[10]  Trevor Mudge,et al.  A self-tuning DVS processor using delay-error detection and correction , 2005, VLSIC 2005.

[11]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[12]  M. Nicolaidis,et al.  Cost reduction and evaluation of a temporary faults detecting technique , 2000, Proceedings Design, Automation and Test in Europe Conference and Exhibition 2000 (Cat. No. PR00537).

[13]  Paul D. Franzon,et al.  FreePDK: An Open-Source Variation-Aware Design Kit , 2007, 2007 IEEE International Conference on Microelectronic Systems Education (MSE'07).

[14]  M. Nicolaidis,et al.  Design for soft error mitigation , 2005, IEEE Transactions on Device and Materials Reliability.

[15]  Mingjing Chen,et al.  Improving Circuit Robustness with Cost-Effective Soft-Error-Tolerant Sequential Elements , 2007, 16th Asian Test Symposium (ATS 2007).

[16]  E. Normand Single event upset at ground level , 1996 .

[17]  Mikel Anton Bezdek,et al.  Utilizing timing error detection and recovery to dynamically improve superscalar processor performance , 2006 .

[18]  David Blaauw,et al.  Opportunities and challenges for better than worst-case design , 2005, ASP-DAC.

[19]  B. Narasimham,et al.  Characterization of Digital Single Event Transient Pulse-Widths in 130-nm and 90-nm CMOS Technologies , 2007, IEEE Transactions on Nuclear Science.

[20]  Vivek De,et al.  Forward body bias for microprocessors in 130nm technology generation and beyond , 2002, VLSIC 2002.

[21]  Ruan Shuangyu,et al.  Soft Error Hardened FF Capable of Detecting Wide Error Pulse , 2008 .

[22]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[23]  Muhammad Ashraful Alam,et al.  Reliability- and Process-variation aware design of integrated circuits — A broader perspective , 2008, 2011 International Reliability Physics Symposium.

[24]  Gary S. Tyson,et al.  Guaranteeing Hits to Improve the Efficiency of a Small Instruction Cache , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[25]  James Tschanz,et al.  Parameter variations and impact on circuits and microarchitecture , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[26]  Jiri Gaisler A portable and fault-tolerant microprocessor based on the SPARC v8 architecture , 2002, Proceedings International Conference on Dependable Systems and Networks.

[27]  Viswanathan Subramanian,et al.  Conjoined Pipeline: Enhancing Hardware Reliability and Performance through Organized Pipeline Redundancy , 2008, 2008 14th IEEE Pacific Rim International Symposium on Dependable Computing.

[28]  P. Eaton,et al.  Soft error rate mitigation techniques for modern microcircuits , 2002, 2002 IEEE International Reliability Physics Symposium. Proceedings. 40th Annual (Cat. No.02CH37320).

[29]  Josep Torrellas,et al.  Paceline: Improving Single-Thread Performance in Nanoscale CMPs through Core Overclocking , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[30]  Babak Falsafi,et al.  Reunion: Complexity-Effective Multicore Redundancy , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[31]  Robert P. Colwell The Zen of overclocking , 2004, Computer.

[32]  Irith Pomeranz,et al.  Transient-fault recovery using simultaneous multithreading , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[33]  Ming Zhang,et al.  Soft Error Resilient System Design through Error Correction , 2006, VLSI-SoC.

[34]  Augustus K. Uht,et al.  Uniprocessor performance enhancement through adaptive clock frequency control , 2005, IEEE Transactions on Computers.