Error Tolerance in Server Class Processors

This paper provides: 1) a very brief motivation and technological trend data to show why hard and soft errors are expected to be of increasing concern in the future; 2) a summary review of chip-level error tolerance practices today-with a brief reference to IBM's POWER6 and POWER7 designs; 3) open research challenges and current solution approaches of promise, based on published literature; and 4) concluding remarks.

[1]  T. Mudge,et al.  On-Chip Cache Device Scaling Limits and Effective Fault Repair Techniques in Future Nanoscale Technology , 2007, 10th Euromicro Conference on Digital System Design Architectures, Methods and Tools (DSD 2007).

[2]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[3]  Shubhendu S. Mukherjee,et al.  A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor , 2003, Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36..

[4]  J. Torrellas,et al.  VARIUS: A Model of Process Variation and Resulting Timing Errors for Microarchitects , 2008, IEEE Transactions on Semiconductor Manufacturing.

[5]  Pradip Bose Designing reliable systems with unreliable components , 2006, IEEE Micro.

[6]  Margaret Martonosi,et al.  Control techniques to eliminate voltage emergencies in high performance processors , 2003, The Ninth International Symposium on High-Performance Computer Architecture, 2003. HPCA-9 2003. Proceedings..

[7]  Shekhar Y. Borkar,et al.  Designing reliable systems from unreliable components: the challenges of transistor variability and degradation , 2005, IEEE Micro.

[8]  Xiaodong Li,et al.  Architecture-Level Soft Error Analysis: Examining the Limits of Common Assumptions , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[9]  Irith Pomeranz,et al.  Transient-fault recovery using simultaneous multithreading , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[10]  Josep Torrellas,et al.  Uncorq: Unconstrained Snoop Request Delivery in Embedded-Ring Multiprocessors , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[11]  Kouichi Kanda,et al.  Design impact of positive temperature dependence of drain current in sub 1 V CMOS VLSIs , 1999 .

[12]  Wei Wu,et al.  Improving cache lifetime reliability at ultra-low voltages , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[13]  Pradip Bose,et al.  Phaser: Phased methodology for modeling the system-level effects of soft errors , 2008, IBM J. Res. Dev..

[14]  Scott A. Mahlke,et al.  Data Access Partitioning for Fine-grain Parallelism on Multicore Architectures , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[15]  Sarita V. Adve,et al.  The impact of technology scaling on lifetime reliability , 2004, International Conference on Dependable Systems and Networks, 2004.

[16]  Shubhendu S. Mukherjee,et al.  Detailed design and evaluation of redundant multithreading alternatives , 2002, ISCA.

[17]  Irith Pomeranz,et al.  Transient-fault recovery for chip multiprocessors , 2003, 30th Annual International Symposium on Computer Architecture, 2003. Proceedings..

[18]  Erika Gunadi,et al.  Combating Aging with the Colt Duty Cycle Equalizer , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[19]  Kevin Skadron,et al.  Temperature-aware microarchitecture , 2003, ISCA '03.

[20]  Elyse Rosenbaum,et al.  Berkeley reliability tools-BERT , 1993, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[21]  Todd M. Austin,et al.  DIVA: a reliable substrate for deep submicron microarchitecture design , 1999, MICRO-32. Proceedings of the 32nd Annual ACM/IEEE International Symposium on Microarchitecture.

[22]  Alaa R. Alameldeen,et al.  Trading off Cache Capacity for Reliability to Enable Low Voltage Operation , 2008, 2008 International Symposium on Computer Architecture.

[23]  Sanjay J. Patel,et al.  ReStore: symptom based soft error detection in microprocessors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[24]  B. Beker,et al.  Modeling of power distribution systems for high-performance microprocessors , 1999 .

[25]  Ieee Circuits,et al.  IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems information for authors , 2018, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[26]  Sol iTherm Thermal management , 2000 .

[27]  Kevin Skadron,et al.  Impact of process variations on multicore performance symmetry , 2007 .

[28]  Pradip Bose,et al.  A Proactive Wearout Recovery Approach for Exploiting Microarchitectural Redundancy to Extend Cache SRAM Lifetime , 2008, 2008 International Symposium on Computer Architecture.

[29]  Manish Gupta,et al.  Power-Aware Microarchitecture: Design and Modeling Challenges for Next-Generation Microprocessors , 2000, IEEE Micro.

[30]  Kouichi Kanda,et al.  Design impact of positive temperature dependence of drain current in sub 1 V CMOS VLSIs , 1999, Proceedings of the IEEE 1999 Custom Integrated Circuits Conference (Cat. No.99CH36327).

[31]  Josep Torrellas,et al.  EVAL: Utilizing processors with variation-induced timing errors , 2008, 2008 41st IEEE/ACM International Symposium on Microarchitecture.

[32]  Xiaodong Li,et al.  SoftArch: an architecture-level tool for modeling and analyzing soft errors , 2005, 2005 International Conference on Dependable Systems and Networks (DSN'05).

[33]  Yi Ma,et al.  Optimizing Dual-Core Execution for Power Efficiency and Transient-Fault Recovery , 2007, IEEE Transactions on Parallel and Distributed Systems.

[34]  Frank Kreith,et al.  CRC Handbook of Thermal Engineering , 1999 .

[35]  Sudhanva Gurumurthi,et al.  Recovery Boosting: A Technique to Enhance NBTI Recovery in SRAM Arrays , 2010, 2010 IEEE Computer Society Annual Symposium on VLSI.

[36]  Eric Rotenberg,et al.  A study of slipstream processors , 2000, MICRO 33.

[37]  Meeta Sharma Gupta,et al.  Tribeca: Design for PVT variations with local recovery and fine-grained adaptation , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[38]  Pradip Bose,et al.  A Framework for Architecture-Level Lifetime Reliability Modeling , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[39]  Eric Rotenberg,et al.  AR-SMT: a microarchitectural approach to fault tolerance in microprocessors , 1999, Digest of Papers. Twenty-Ninth Annual International Symposium on Fault-Tolerant Computing (Cat. No.99CB36352).

[40]  Lorenzo Alvisi,et al.  Modeling the effect of technology trends on the soft error rate of combinational logic , 2002, Proceedings International Conference on Dependable Systems and Networks.

[41]  Pradip Bose,et al.  The case for lifetime reliability-aware microprocessors , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[42]  Vanish Talwar,et al.  Power Management of Datacenter Workloads Using Per-Core Power Gating , 2009, IEEE Computer Architecture Letters.

[43]  Rajendra S. Katti,et al.  A New Residue Arithmetic Error Correction Scheme , 1996, IEEE Trans. Computers.

[44]  David M. Brooks,et al.  Mitigating the Impact of Process Variations on Processor Register Files and Execution Units , 2006, 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06).

[45]  Jaume Abella,et al.  Penelope: The NBTI-Aware Processor , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[46]  Sachin S. Sapatnekar,et al.  Impact of NBTI on SRAM read stability and design for reliability , 2006, 7th International Symposium on Quality Electronic Design (ISQED'06).

[47]  K. P. Rodbell,et al.  AC electromigration (10 MHz–1 GHz) in Al metallization , 1998 .

[48]  Minxuan Zhang,et al.  RC-Cache: Soft error mitigation techniques for low-leakage on-chip caches , 2010, 2010 2nd International Conference on Signal Processing Systems.

[49]  T. N. Vijaykumar,et al.  Pipeline muffling and a priori current ramping: architectural techniques to reduce high-frequency inductive noise , 2003, ISLPED '03.

[50]  Balaram Sinharoy,et al.  POWER7: IBM's next generation server processor , 2010, 2009 IEEE Hot Chips 21 Symposium (HCS).

[51]  Pradip Bose,et al.  Dynamic power gating with quality guarantees , 2009, ISLPED.

[52]  Ram Chillarege,et al.  IBM's ES/9000 Model 982's fault-tolerant design for consolidation , 1994, IEEE Micro.

[53]  Alan J. Weger,et al.  Power-efficient, reliable microprocessor architectures: modeling and design methods , 2010, GLSVLSI '10.

[54]  Bruce G. Mealey,et al.  IBM POWER6 reliability , 2007, IBM J. Res. Dev..

[55]  Josep Torrellas Architectures for Extreme-Scale Computing , 2009, Computer.

[56]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[57]  Josep Torrellas,et al.  Mitigating Parameter Variation with Dynamic Fine-Grain Body Biasing , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[58]  Sarita V. Adve,et al.  Understanding the propagation of hard errors to software and implications for resilient system design , 2008, ASPLOS.

[59]  C. L. Chen,et al.  Fault-tolerance design of the IBM Enterprise System/9000 Type 9021 processors , 1992, IBM J. Res. Dev..

[60]  BorkarShekhar Designing Reliable Systems from Unreliable Components , 2005 .