Multiple-Bit Upset Protection in Microprocessor Memory Arrays Using Vulnerability-Based Parity Optimization and Interleaving

We propose a technology-independent vulnerability-driven parity selection method for protecting modern microprocessor in-core memory arrays against multiple-bit upsets (MBUs). As MBUs constitute over 50% of the upsets in recent technologies, error correcting codes or physical interleaving are typically employed to effectively protect out-of-core memory structures, such as caches. Such methods, however, are not applicable to high performance in-core arrays, due to computational complexity, high delay, and area overhead. Therefore, we investigate vulnerability-based parity forest formation as an effective mechanism for detecting errors. Checkpointing and pipeline flushing can subsequently be used for correction. As the optimal parity tree construction for MBU detection is a computationally complex problem, an integer linear program formulation is introduced. In addition, vulnerability-based interleaving (VBI) is explored as a mechanism for further enhancing in-core array resiliency in constrained, single parity tree cases. VBI first physically disperses bitlines based on their vulnerability factor and then applies selective parity to these lines. Experimental results on Alpha 21264 and Intel P6 in-core memory arrays demonstrate that the proposed parity tree selection and VBI methods can achieve vulnerability reduction up to 86%, even when a small number of bits are added to the parity trees.

[1]  B. Narasimham,et al.  Radiation-Induced Soft Error Rates of Advanced CMOS Bulk Devices , 2006, 2006 IEEE International Reliability Physics Symposium Proceedings.

[2]  Lisa Spainhower,et al.  IBM S/390 Parallel Enterprise Server G5 fault tolerance: A historical perspective , 1999, IBM J. Res. Dev..

[3]  K. Pagiamtzis,et al.  Content-addressable memory (CAM) circuits and architectures: a tutorial and survey , 2006, IEEE Journal of Solid-State Circuits.

[4]  Babak Falsafi,et al.  Multi-bit Error Tolerant Caches Using Two-Dimensional Error Coding , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[5]  Jaume Abella,et al.  Power- and Complexity-Aware Issue Queue Designs , 2003, IEEE Micro.

[6]  Shyue-Kung Lu,et al.  Fault-Tolerant Interleaved Memory Systems with Two-Level Redundancy , 1997, IEEE Trans. Computers.

[7]  Marc Tremblay,et al.  Rock: A High-Performance Sparc CMT Processor , 2009, IEEE Micro.

[8]  Prabhakar Kudva,et al.  Soft-error resilience of the IBM POWER6 processor , 2008, IBM J. Res. Dev..

[9]  Michail Maniatakos,et al.  AVF-driven parity optimization for MBU protection of in-core memory arrays , 2013, 2013 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[10]  Josep Torrellas,et al.  ReVive: cost-effective architectural support for rollback recovery in shared-memory multiprocessors , 2002, ISCA.

[11]  Kathryn Wilcox,et al.  Circuit implementation of a 600 MHz superscalar RISC microprocessor , 1998, Proceedings International Conference on Computer Design. VLSI in Computers and Processors (Cat. No.98CB36273).

[12]  C.W. Slayman,et al.  Cache and memory error detection, correction, and reduction techniques for terrestrial servers and workstations , 2005, IEEE Transactions on Device and Materials Reliability.

[13]  K. Kushida,et al.  A low leakage SRAM macro with replica cell biasing scheme , 2005, Digest of Technical Papers. 2005 Symposium on VLSI Circuits, 2005..

[14]  Richard E. Kessler,et al.  The Alpha 21264 microprocessor , 1999, IEEE Micro.

[15]  Satyam Mandavilli,et al.  Process Variation Tolerant SRAM Cell Design , 2011, 2011 International Symposium on Electronic System Design.

[16]  Alan Wood,et al.  The impact of new technology on soft error rates , 2011, 2011 International Reliability Physics Symposium.

[17]  Ram Huggahalli,et al.  Impact of Cache Coherence Protocols on the Processing of Network Traffic , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[18]  Richard W. Hamming,et al.  Error detecting and error correcting codes , 1950 .

[19]  Yiorgos Makris,et al.  Entropy-driven parity-tree selection for low-overhead concurrent error detection in finite state machines , 2006, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[20]  Nur A. Touba,et al.  Logic synthesis of multilevel circuits with concurrent error detection , 1997, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst..

[21]  Cecilia Metra,et al.  Error correcting code analysis for cache memory high reliability and performance , 2011, 2011 Design, Automation & Test in Europe.

[22]  J. Draper,et al.  Parallel double error correcting code design to mitigate multi-bit upsets in SRAMs , 2008, ESSCIRC 2008 - 34th European Solid-State Circuits Conference.

[23]  M. Khellah,et al.  Wordline & Bitline Pulsing Schemes for Improving SRAM Cell Stability in Low-Vcc 65nm CMOS Designs , 2006, 2006 Symposium on VLSI Circuits, 2006. Digest of Technical Papers..

[24]  J. Meindl,et al.  The impact of intrinsic device fluctuations on CMOS SRAM cell stability , 2001, IEEE J. Solid State Circuits.

[25]  Tobias Achterberg,et al.  SCIP: solving constraint integer programs , 2009, Math. Program. Comput..

[26]  E. Amirante,et al.  Investigation of Increased Multi-Bit Failure Rate Due to Neutron Induced SEU in Advanced Embedded SRAMs , 2007, 2007 IEEE Symposium on VLSI Circuits.

[27]  K. Osada,et al.  SRAM immunity to cosmic-ray-induced multierrors based on analysis of an induced parasitic bipolar effect , 2004, IEEE Journal of Solid-State Circuits.

[28]  Milo M. K. Martin,et al.  SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery , 2002, Proceedings 29th Annual International Symposium on Computer Architecture.

[29]  R. Reed,et al.  Heavy ion and proton-induced single event multiple upset , 1997 .

[30]  Michail Maniatakos,et al.  Design and Evaluation of a Timestamp-Based Concurrent Error Detection Method (CED) in a Modern Microprocessor Controller , 2008, 2008 IEEE International Symposium on Defect and Fault Tolerance of VLSI Systems.

[31]  E. Ibe,et al.  Impact of Scaling on Neutron-Induced Soft Error in SRAMs From a 250 nm to a 22 nm Design Rule , 2010, IEEE Transactions on Electron Devices.

[32]  Pradip Bose,et al.  Tradeoffs in power-efficient issue queue design , 2002, ISLPED '02.

[33]  M. Y. Hsiao,et al.  A class of optimal minimum odd-weight-column SEC-DED codes , 1970 .

[34]  OTTAinr mspscnsD,et al.  Single-Word Multiple-Bit Upsets in Static Random Access Devices , .

[35]  John Lach,et al.  Transient fault models and AVF estimation revisited , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[36]  Tryggve Fossum,et al.  Cache scrubbing in microprocessors: myth or necessity? , 2004, 10th IEEE Pacific Rim International Symposium on Dependable Computing, 2004. Proceedings..

[37]  N. Vallepalli,et al.  A 3-GHz 70-mb SRAM in 65-nm CMOS technology with integrated column-based dynamic power supply , 2005, IEEE Journal of Solid-State Circuits.

[38]  Lawrence Clark,et al.  Delay and Area Efficient First-level Cache Soft Error Detection and Correction , 2006, 2006 International Conference on Computer Design.

[39]  Prabhakar Kudva,et al.  Soft-error resilience of the IBM POWER6 processor input/output subsystem , 2008, IBM J. Res. Dev..

[40]  Michail Maniatakos,et al.  AVF Analysis Acceleration via Hierarchical Fault Pruning , 2011, 2011 Sixteenth IEEE European Test Symposium.

[41]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.