Flexible Error Protection for Energy Efficient Reliable Architectures

Technology scaling is having an increasingly detrimental effect on microprocessor reliability, with increased variability and higher susceptibility to errors. At the same time, as integration of chip multiprocessors increases, power consumption is becoming a significant bottleneck that could threaten their growth. To deal with these competing trends, energy-efficient solutions are needed to deal with reliability problems. This paper presents a reliable multicore architecture that provides targeted error protection by adapting to the characteristics of individual cores and workloads, with the goal of providing reliability with minimum energy. The user can specify an acceptable reliability target for each chip, core, or application. The system then adjusts a range of parameters, including replication and supply voltage, to meet that reliability goal. In this multicore architecture, each core consists of a pair of pipelines that can run independently (running separate threads) or in concert (running the same thread and verifying results). Redundancy is enabled selectively, at functional unit granularity. The architecture also employs timing speculation for mitigation of variation-induced timing errors and to reduce the power overhead of error protection. On-line control based on machine learning dynamically adjusts multiple parameters to minimize energy consumption. Evaluation shows that dynamic adaptation of voltage and redundancy can reduce the energy delay product of a CMP by 30 − 60% compared to static dual modular redundancy.

[1]  Kevin Skadron,et al.  Temperature-aware microarchitecture , 2003, ISCA '03.

[2]  Malgorzata Marek-Sadowska,et al.  Power gating scheduling for power/ground noise reduction , 2008, 2008 45th ACM/IEEE Design Automation Conference.

[3]  Gu-Yeon Wei,et al.  Revival: A Variation-Tolerant Architecture Using Voltage Interpolation and Variable Latency , 2009, IEEE Micro.

[4]  Amin Ansari,et al.  The StageNet fabric for constructing resilient multicore systems , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[5]  Josep Torrellas,et al.  Paceline: Improving Single-Thread Performance in Nanoscale CMPs through Core Overclocking , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[6]  Ming Zhang,et al.  Combinational Logic Soft Error Correction , 2006, 2006 IEEE International Test Conference.

[7]  Josep Torrellas,et al.  EVAL: Utilizing processors with variation-induced timing errors , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[8]  J. Tschanz,et al.  Neutron soft error rate measurements in a 90-nm CMOS process and scaling trends in SRAM from 0.25-/spl mu/m to 90-nm generation , 2003, IEEE International Electron Devices Meeting 2003.

[9]  Daniel J. Sorin,et al.  Core Cannibalization Architecture: Improving lifetime chip performance for multicore processors in the presence of hard faults , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[10]  Onur Mutlu,et al.  Online design bug detection: RTL analysis, flexible mechanisms, and evaluation , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[11]  J. Torrellas,et al.  VARIUS: A Model of Parameter Variation and Resulting Timing Errors for Microarchitects , 2007 .

[12]  Josep Torrellas,et al.  Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[13]  Shubhendu S. Mukherjee,et al.  APast Future Time Quantized AVF : A Means of Capturing Vulnerability Variations over Small Windows of Time , 2009 .

[14]  Josep Torrellas,et al.  Mitigating Parameter Variation with Dynamic Fine-Grain Body Biasing , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[15]  Sudhanva Gurumurthi,et al.  Dynamic prediction of architectural vulnerability from microarchitectural state , 2007, ISCA '07.

[16]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[17]  J. Torrellas,et al.  VARIUS: A Model of Process Variation and Resulting Timing Errors for Microarchitects , 2008, IEEE Transactions on Semiconductor Manufacturing.

[18]  C. Pipper,et al.  [''R"--project for statistical computing]. , 2008, Ugeskrift for laeger.

[19]  R. Kumar,et al.  An Integrated Quad-Core Opteron Processor , 2007, 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers.

[20]  James E. Smith,et al.  Configurable isolation: building high availability systems with commodity multi-core processors , 2007, ISCA '07.

[21]  Xiaodong Li,et al.  SoftArch: an architecture-level tool for modeling and analyzing soft errors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[22]  James Tschanz,et al.  Parameter variations and impact on circuits and microarchitecture , 2003, Proceedings 2003. Design Automation Conference (IEEE Cat. No.03CH37451).

[23]  Meeta Sharma Gupta,et al.  System level analysis of fast, per-core DVFS using on-chip switching regulators , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[24]  Daniel A. Jiménez,et al.  Low-power, high-performance analog neural branch prediction , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[25]  Todd M. Austin,et al.  A fault tolerant approach to microprocessor design , 2001, 2001 International Conference on Dependable Systems and Networks.

[26]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[27]  S. Naffziger,et al.  Power and temperature control on a 90-nm Itanium family processor , 2006, IEEE Journal of Solid-State Circuits.

[28]  Timothy J. Slegel,et al.  IBM's S/390 G5 microprocessor design , 1999, IEEE Micro.

[29]  Onur Mutlu,et al.  Self-Optimizing Memory Controllers: A Reinforcement Learning Approach , 2008, 2008 International Symposium on Computer Architecture.

[30]  Trevor Mudge,et al.  Razor: a low-power pipeline based on circuit-level timing speculation , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..