Error-Resilient Design Techniques for Reliable and Dependable Computing

Integrated circuits in modern systems-on-chip and microprocessors are typically operated with sufficient timing margins to mitigate the impact of rising process, voltage, and temperature (PVT) variations at advanced process nodes. The widening margins required for ensuring robust computation inevitably lead to conservative designs with unacceptable energy-efficiency overheads. Reconciling the conflicting objectives imposed by variation mitigation and energy-efficient computing will require fundamental departures from conventional circuit and system design practices. This paper posits error-resilient general-purpose computing as an effective approach for achieving this. We review resilient techniques that exploit tolerance to timing errors to automatically compensate for variations and dynamically tune a system to its most efficient operating point. We present the Razor approach as a pioneering example of such a technique. We present silicon measurement results from multiple industrial and academic demonstration systems that employ Razor dynamic voltage and frequency management. In particular, we highlight the application of Razor to two specific platforms. The first is an ARM-based industrial prototype where Razor dynamic adaptation leads to 52% energy savings at 1 GHz operation. The second platform applies Razor for robust operation in the presence of radiation-induced Single Event Upsets. These efforts clearly demonstrate how energy-efficient compute engines can be designed by combining timing-error resiliency with optimizations across algorithms, circuits, and microarchitecture boundaries.

[1]  S. Naffziger,et al.  A 90-nm variable frequency clock system for a power-managed itanium architecture processor , 2006, IEEE Journal of Solid-State Circuits.

[2]  Karthikeyan Sankaralingam,et al.  Dark Silicon and the End of Multicore Scaling , 2012, IEEE Micro.

[3]  Ravi Iyengar,et al.  28nm high- metal-gate heterogeneous quad-core CPUs for high-performance and energy-efficient mobile application processor , 2013, 2013 IEEE International Solid-State Circuits Conference Digest of Technical Papers.

[4]  David Blaauw,et al.  A Power-Efficient 32 bit ARM Processor Using Timing-Error Detection and Correction for Transient-Error Tolerance and Adaptation to PVT Variation , 2011, IEEE Journal of Solid-State Circuits.

[5]  David M. Bull,et al.  RazorII: In Situ Error Detection and Correction for PVT and SER Tolerance , 2009, IEEE Journal of Solid-State Circuits.

[6]  Paolo A. Aseron,et al.  A 45 nm Resilient Microprocessor Core for Dynamic Variation Tolerance , 2011, IEEE Journal of Solid-State Circuits.

[7]  Shidhartha Das,et al.  A Low-Power 1-GHz Razor FIR Accelerator With Time-Borrow Tracking Pipeline and Approximate Error Correction in 65-nm CMOS , 2014, IEEE Journal of Solid-State Circuits.

[8]  Shohaib Aboobacker RAZOR: circuit-level correction of timing errors for low-power operation , 2011 .

[9]  Trevor Mudge,et al.  A self-tuning DVS processor using delay-error detection and correction , 2005, VLSIC 2005.

[10]  Izzat Darwazeh,et al.  Circuit-Level Timing Error Tolerance for Low-Power DSP Filters and Transforms , 2013, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[11]  Bishop Brock,et al.  A 32-bit PowerPC system-on-a-chip with support for dynamic voltage scaling and dynamic frequency scaling , 2002, IEEE J. Solid State Circuits.

[12]  Pradip Bose,et al.  Cross-layer system resilience at affordable power , 2014, 2014 IEEE International Reliability Physics Symposium.

[13]  T. Kataoka,et al.  A 28nm High-κ metal-gate single-chip communications processor with 1.5GHz dual-core application processor and LTE/HSPA+-capable baseband processor , 2013, 2013 IEEE International Solid-State Circuits Conference Digest of Technical Papers.

[14]  Michael Nicolaidis Time redundancy based soft-error tolerance to rescue nanometer technologies , 1999, Proceedings 17th IEEE VLSI Test Symposium (Cat. No.PR00146).

[15]  Shidhartha Das,et al.  A 1GHz hardware loop-accelerator with razor-based dynamic adaptation for energy-efficient operation , 2013, Proceedings of the IEEE 2013 Custom Integrated Circuits Conference.

[16]  Saurabh Dighe,et al.  Adaptive Frequency and Biasing Techniques for Tolerance to Dynamic Temperature-Voltage Variations and Aging , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[17]  Soraya Ghiasi,et al.  A Distributed Critical-Path Timing Monitor for a 65nm High-Performance Microprocessor , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[18]  J. Tschanz,et al.  Tunable replica circuits and adaptive voltage-frequency techniques for dynamic voltage, temperature, and aging variation tolerance , 2009, 2009 Symposium on VLSI Circuits.

[19]  S. Naffziger,et al.  Power and temperature control on a 90-nm Itanium family processor , 2006, IEEE Journal of Solid-State Circuits.

[20]  David Blaauw,et al.  Adaptive design for nanometer technology , 2009, 2009 IEEE International Symposium on Circuits and Systems.

[21]  David Blaauw,et al.  Correction to "A Power-Efficient 32 bit ARM Processor Using Timing-Error Detection and Correction for Transient-Error Tolerance and Adaptation to PVT Variation" , 2011, IEEE J. Solid State Circuits.

[22]  Sangwoo Pae,et al.  Development of thermal neutron SER-resilient high-k/metal gate technology , 2014, 2014 IEEE International Reliability Physics Symposium.