Energy-efficient fault tolerance in chip multiprocessors using Critical Value Forwarding

Relentless CMOS scaling coupled with lower design tolerances is making ICs increasingly susceptible to wear-out related permanent faults and transient faults, necessitating on-chip fault tolerance in future chip microprocessors (CMPs). In this paper we introduce a new energy-efficient fault-tolerant CMP architecture known as Redundant Execution using Critical Value Forwarding (RECVF). RECVF is based on two observations: (i) forwarding critical instruction results from the leading to the trailing core enables the latter to execute faster, and (ii) this speedup can be exploited to reduce energy consumption by operating the trailing core at a lower voltage-frequency level. Our evaluation shows that RECVF consumes 37% less energy than conventional dual modular redundant (DMR) execution of a program. It consumes only 1.26 times the energy of a non-fault-tolerant baseline and has a performance overhead of just 1.2%.

[1]  P. Hazucha,et al.  Impact of CMOS technology scaling on the atmospheric neutron soft error rate , 2000 .

[2]  Shubu Mukherjee,et al.  Architecture Design for Soft Errors , 2008 .

[3]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[4]  Benjamin C. Lee,et al.  Effects of pipeline complexity on SMT/CMP power-performance efficiency , 2005 .

[5]  David I. August,et al.  Design and Evaluation of Hybrid Fault-Detection Systems , 2005, ISCA 2005.

[6]  James E. Smith,et al.  Isolation in Commodity Multicore Processors , 2007, Computer.

[7]  Rajeev Balasubramonian,et al.  Power Efficient Approaches to Redundant Multithreading , 2007, IEEE Transactions on Parallel and Distributed Systems.

[8]  Israel Koren,et al.  Fault-Tolerant Systems , 2007 .

[9]  Viswanathan Subramanian,et al.  Low overhead Soft Error Mitigation techniques for high-performance and aggressive systems , 2009, 2009 IEEE/IFIP International Conference on Dependable Systems & Networks.

[10]  Margaret Martonosi,et al.  Wattch: a framework for architectural-level power analysis and optimizations , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[11]  Yu Wang,et al.  On the efficacy of input Vector Control to mitigate NBTI effects and leakage power , 2009, 2009 10th International Symposium on Quality Electronic Design.

[12]  Irith Pomeranz,et al.  Transient-Fault Recovery for Chip Multiprocessors , 2003, IEEE Micro.

[13]  Babak Falsafi,et al.  Fingerprinting: bounding soft-error-detection latency and bandwidth , 2004, IEEE Micro.

[14]  Joel Emer,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[15]  Jude A. Rivers,et al.  Reliability Challenges and System Performance at the Architecture Level , 2009, IEEE Design & Test of Computers.

[16]  David García,et al.  NonStop/spl reg/ advanced architecture , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[17]  Zhenyu Qi,et al.  NBTI resilient circuits using adaptive body biasing , 2008, GLSVLSI '08.

[18]  Kunle Olukotun,et al.  The case for a single-chip multiprocessor , 1996, ASPLOS VII.

[19]  Erik G. Larsson,et al.  Power Efficient Redundant Execution for Chip Multiprocessors , 2009 .

[20]  Joel S. Emer,et al.  The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.

[21]  Shubhendu S. Mukherjee,et al.  Perturbation-based Fault Screening , 2007, 2007 IEEE 13th International Symposium on High Performance Computer Architecture.

[22]  Gurindar S. Sohi,et al.  A static power model for architects , 2000, MICRO 33.

[23]  Michael C. Huang,et al.  Supporting highly-decoupled thread-level redundancy for parallel programs , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[24]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[25]  Josep Torrellas,et al.  Facelift: Hiding and slowing down aging in multicores , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[26]  Kevin Skadron,et al.  HotLeakage: A Temperature-Aware Model of Subthreshold and Gate Leakage for Architects , 2003 .

[27]  Karthikeyan Sankaralingam,et al.  Relax: an architectural framework for software recovery of hardware faults , 2010, ISCA.

[28]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[29]  Pradip Bose,et al.  The case for lifetime reliability-aware microprocessors , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[30]  Anand Sivasubramaniam,et al.  SlicK: slice-based locality exploitation for efficient redundant multithreading , 2006, ASPLOS XII.

[31]  Michael C. Huang,et al.  A performance-correctness explicitly-decoupled architecture , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[32]  Rakesh Kumar,et al.  A numerical optimization-based methodology for application robustification: Transforming applications for error tolerance , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[33]  Shlomo Weiss,et al.  DDMR: Dynamic and Scalable Dual Modular Redundancy with Short Validation Intervals , 2008, IEEE Computer Architecture Letters.

[34]  Kevin Skadron,et al.  Impact of process variations on multicore performance symmetry , 2007 .

[35]  Sanjay J. Patel,et al.  ReStore: symptom based soft error detection in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[36]  Ku He,et al.  Temperature-aware NBTI modeling and the impact of input vector control on performance degradation , 2007, 2007 Design, Automation & Test in Europe Conference & Exhibition.

[37]  Yuan Chou,et al.  Low-Cost Epoch-Based Correlation Prefetching for Commercial Applications , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[38]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[39]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[40]  Jaume Abella,et al.  Penelope: The NBTI-Aware Processor , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[41]  James E. Smith,et al.  Configurable isolation: building high availability systems with commodity multi-core processors , 2007, ISCA '07.

[42]  Kaushik Roy,et al.  Negative Bias Temperature Instability: Estimation and Design for Improved Reliability of Nanoscale Circuits , 2007, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[43]  Bashir M. Al-Hashimi,et al.  Combined time and information redundancy for SEU-tolerance in energy-efficient real-time systems , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[44]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[45]  Brad Calder,et al.  Dynamic prediction of critical path instructions , 2001, Proceedings HPCA Seventh International Symposium on High-Performance Computer Architecture.

[46]  Scott A. Mahlke,et al.  Reliability: Fallacy or Reality? , 2007, IEEE Micro.

[47]  John Sartori,et al.  Designing a processor from the ground up to allow voltage/reliability tradeoffs , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[48]  Thomas F. Wenisch,et al.  PowerNap: eliminating server idle power , 2009, ASPLOS.

[49]  P.N. Sanda,et al.  IBM z990 soft error detection and recovery , 2005, IEEE Transactions on Device and Materials Reliability.

[50]  Dean M. Tullsen,et al.  Symbiotic jobscheduling for a simultaneous mutlithreading processor , 2000, SIGP.

[51]  J. Hoe,et al.  OpenSPARC : An Open Platform for Hardware Reliability Experimentation , 2008 .

[52]  Yu Cao,et al.  Modeling and minimization of PMOS NBTI effect for robust nanometer design , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[53]  Dean M. Tullsen,et al.  Interconnections in Multi-Core Architectures: Understanding Mechanisms, Overheads and Scaling , 2005, ISCA 2005.

[54]  Yen-Kuang Chen,et al.  The energy efficiency of CMP vs. SMT for multimedia workloads , 2004, ICS '04.

[55]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[56]  Onur Mutlu,et al.  Microarchitecture-based introspection: a technique for transient-fault tolerance in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[57]  Meeta Sharma Gupta,et al.  System level analysis of fast, per-core DVFS using on-chip switching regulators , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[58]  Michael C. Huang,et al.  Exploiting coarse-grain verification parallelism for power-efficient fault tolerance , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[59]  Sudhanva Gurumurthi,et al.  NBTI-Aware Dynamic Instruction Scheduling , .

[60]  Arun K. Somani,et al.  REESE: a method of soft error detection in microprocessors , 2001, 2001 International Conference on Dependable Systems and Networks.

[61]  Eric Rotenberg,et al.  Slipstream processors: improving both performance and fault tolerance , 2000, SIGP.

[62]  Engin Ipek,et al.  Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[63]  Kevin Skadron,et al.  Performance, energy, and thermal considerations for SMT and CMP architectures , 2005, 11th International Symposium on High-Performance Computer Architecture.

[64]  Trevor Mudge,et al.  Razor: a low-power pipeline based on circuit-level timing speculation , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[65]  Aneesh Aggarwal,et al.  Speculative instruction validation for performance-reliability trade-off , 2008, 2008 IEEE 14th International Symposium on High Performance Computer Architecture.

[66]  Sachin S. Sapatnekar,et al.  Impact of NBTI on SRAM read stability and design for reliability , 2006, 7th International Symposium on Quality Electronic Design (ISQED'06).

[67]  Kewal K. Saluja,et al.  A study of time-redundant fault tolerance techniques for high-performance pipelined computers , 1989, [1989] The Nineteenth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[68]  John D. McCalpin,et al.  Characterization of simultaneous multithreading (SMT) efficiency in POWER5 , 2005, IBM J. Res. Dev..

[69]  Hideo Fujiwara,et al.  Instruction-Based Self-Testing of Delay Faults in Pipelined Processors , 2006, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[70]  Joel S. Emer,et al.  Techniques to reduce the soft error rate of a high-performance microprocessor , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[71]  Josep Torrellas,et al.  Paceline: Improving Single-Thread Performance in Nanoscale CMPs through Core Overclocking , 2007, 16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007).

[72]  Erik G. Larsson,et al.  Generation of Minimal Leakage Input Vectors with Constrained NBTI Degradation , 2009 .

[73]  Omer Khan,et al.  Improving yield and reliability of chip multiprocessors , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[74]  José F. Martínez,et al.  Cherry-MP: correctly integrating checkpointed early resource recycling in chip multiprocessors , 2005, 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'05).

[75]  Margaret Martonosi,et al.  An Analysis of Efficient Multi-Core Global Power Management Policies: Maximizing Performance for a Given Power Budget , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[76]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[77]  Anoop Gupta,et al.  The SPLASH-2 programs: characterization and methodological considerations , 1995, ISCA.

[78]  Pradeep Dubey,et al.  Platform 2015: Intel ® Processor and Platform Evolution for the Next Decade , 2005 .

[79]  Muhammad Ashraful Alam,et al.  Reliability- and Process-variation aware design of integrated circuits — A broader perspective , 2008, 2011 International Reliability Physics Symposium.

[80]  Kewal K. Saluja,et al.  Multiplexed redundant execution: A technique for efficient fault tolerance in chip multiprocessors , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[81]  James E. Smith,et al.  Implementing high availability memory with a duplication cache , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[82]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[83]  Jaume Abella,et al.  Selective replication: A lightweight technique for soft errors , 2009, TOCS.

[84]  James E. Smith,et al.  Complexity-Effective Superscalar Processors , 1997, Conference Proceedings. The 24th Annual International Symposium on Computer Architecture.

[85]  Todd M. Austin,et al.  Shielding against design flaws with field repairable control logic , 2006, 2006 43rd ACM/IEEE Design Automation Conference.

[86]  Babak Falsafi,et al.  Reunion: Complexity-Effective Multicore Redundancy , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[87]  Douglas L. Jones,et al.  Scalable stochastic processors , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[88]  Jaume Abella,et al.  Electromigration for microarchitects , 2010, CSUR.

[89]  Kewal K. Saluja,et al.  Combating NBTI Degradation via Gate Sizing , 2007, 8th International Symposium on Quality Electronic Design (ISQED'07).

[90]  N. Seifert,et al.  Robust system design with built-in soft-error resilience , 2005, Computer.

[91]  Mona Attariyan,et al.  Low-cost protection for SER upsets and silicon defects , 2007 .

[92]  Josep Torrellas,et al.  Blueshift: Designing processors for timing speculation from the ground up. , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[93]  Scott A. Mahlke,et al.  Reliable Systems on Unreliable Fabrics , 2008, IEEE Design & Test of Computers.

[94]  Luiz C. Alves,et al.  Reliability, availability, and serviceability (RAS) of the IBM eServer z990 , 2004, IBM J. Res. Dev..

[95]  Kewal K. Saluja,et al.  Energy-efficient redundant execution for chip multiprocessors , 2010, GLSVLSI '10.

[96]  Vivek De,et al.  Measurements and analysis of SER-tolerant latch in a 90-nm dual-V/sub T/ CMOS process , 2004 .

[97]  Sanjay J. Patel,et al.  Characterizing the effects of transient faults on a high-performance processor pipeline , 2004, International Conference on Dependable Systems and Networks, 2004.

[98]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[99]  Viswanathan Subramanian,et al.  Superscalar Processor Performance Enhancement through Reliable Dynamic Clock Frequency Tuning , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[100]  Eric Rotenberg,et al.  A study of slipstream processors , 2000, MICRO 33.

[101]  Craig B. Zilles,et al.  A characterization of instruction-level error derating and its implications for error detection , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[102]  Henry H. K. Tang,et al.  Nuclear physics of cosmic ray interaction with semiconductor materials: Particle-induced soft errors from a physicist's perspective , 1996, IBM J. Res. Dev..

[103]  Jose Renau,et al.  Effective Optimistic-Checker Tandem Core Design through Architectural Pruning , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).