Calculating Architectural Vulnerability Factors for Spatial Multi-Bit Transient Faults

Reliability is an important design constraint in modern microprocessors, and one of the fundamental reliability challenges is combating the effects of transient faults. This requires extensive analysis, including significant fault modelling allow architects to make informed reliability tradeoffs. Recent data shows that multi-bit transient faults are becoming more common, increasing from 0.5% of static random-access memory (SRAM) faults in 180nm to 3.9% in 22nm. Such faults are predicted to be even more prevalent in smaller technology nodes. Therefore, accurately modeling the effects of multi-bit transient faults is increasingly important to the microprocessor design process. Architecture vulnerability factor (AVF) analysis is a method to model the effects of single-bit transient faults. In this paper, we propose a method to calculate AVFs for spatial multibittransient faults (MB-AVFs) and provide insights that can help reduce the impact of these faults. First, we describe a novel multi-bit AVF analysis approach for detected uncorrected errors (DUEs) and show how to measure DUE MB-AVFs in a performance simulator. We then extend our approach to measure silent data corruption (SDC) MB-AVFs. We find that MB-AVFs are not derivable from single-bit AVFs. We also find that larger fault modes have higher MB-AVFs. Finally, we present a case study on using MB-AVF analysis to optimize processor design, yielding SDC reductions of 86% in a GPU vector register file.

[1]  E. Ibe,et al.  Impact of Scaling on Neutron-Induced Soft Error in SRAMs From a 250 nm to a 22 nm Design Rule , 2010, IEEE Transactions on Electron Devices.

[2]  Hyeran Jeon,et al.  Warped-DMR: Light-weight Error Detection for GPGPU , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[3]  Mehdi Baradaran Tahoori,et al.  A Field Analysis of System-level Effects of Soft Errors Occurring in Microprocessors used in Information Systems , 2008, 2008 IEEE International Test Conference.

[4]  Michael Allen Heroux Miniapplications: Vehicles for Co-design. , 2011 .

[5]  Arun K. Somani,et al.  Soft error sensitivity characterization for microprocessor dependability enhancement strategy , 2002, Proceedings International Conference on Dependable Systems and Networks.

[6]  R.C. Baumann,et al.  Radiation-induced soft errors in advanced semiconductor technologies , 2005, IEEE Transactions on Device and Materials Reliability.

[7]  John Lach,et al.  Bit-slice logic interleaving for spatial multi-bit soft-error tolerance , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[8]  Hannu Tenhunen,et al.  Switching Sensitive Driver Circuit to Combat Dynamic Delay in On-Chip Buses , 2005, PATMOS.

[9]  Xin Fu,et al.  Analyzing soft-error vulnerability on GPGPU microarchitecture , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[10]  Janak H. Patel,et al.  Reliability of scrubbing recovery-techniques for memory systems , 1990 .

[11]  David R. Kaeli,et al.  Multi2Sim: A simulation framework for CPU-GPU computing , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[12]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[13]  Sudhanva Gurumurthi,et al.  Feng Shui of supercomputer memory positional effects in DRAM and SRAM faults , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[14]  Gabriel H. Loh,et al.  Architectural Vulnerability Modeling and Analysis of Integrated Graphics Processors , 2013 .

[15]  Chris Weller,et al.  Error injection-based study of soft error propagation in AMD Bulldozer microprocessor module , 2012, IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012).

[16]  David R. Kaeli,et al.  Using hardware vulnerability factors to enhance AVF analysis , 2010, ISCA.

[17]  R. Allmon,et al.  Soft Error Susceptibilities of 22 nm Tri-Gate Devices , 2012, IEEE Transactions on Nuclear Science.

[18]  Arijit Biswas,et al.  Computing Accurate AVFs using ACE Analysis on Performance Models: A Rebuttal , 2008, IEEE Computer Architecture Letters.

[19]  Xiaodong Li,et al.  Architecture-Level Soft Error Analysis: Examining the Limits of Common Assumptions , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[20]  Stijn Eyerman,et al.  A first-order mechanistic model for architectural vulnerability factor , 2012, 2012 39th Annual International Symposium on Computer Architecture (ISCA).

[21]  Doris Schmitt-Landsiedel,et al.  A Design Space Comparison of 6T and 8T SRAM Core-Cells , 2008, PATMOS.

[22]  David R. Kaeli,et al.  Eliminating microarchitectural dependency from Architectural Vulnerability , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[23]  Alan B. Williams,et al.  Poster: mini-applications: vehicles for co-design , 2011, SC '11 Companion.

[24]  Shubhendu S. Mukherjee,et al.  APast Future Time Quantized AVF : A Means of Capturing Vulnerability Variations over Small Windows of Time , 2009 .

[25]  Arijit Biswas,et al.  Computing architectural vulnerability factors for address-based structures , 2005, 32nd International Symposium on Computer Architecture (ISCA'05).

[26]  Michel Dubois,et al.  MACAU: A Markov model for reliability evaluations of caches under Single-bit and Multi-bit Upsets , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[27]  Scott A. Mahlke,et al.  Runtime asynchronous fault tolerance via speculation , 2012, CGO '12.

[28]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[29]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[30]  Philip Koopman,et al.  Cyclic redundancy code (CRC) polynomial selection for embedded networks , 2004, International Conference on Dependable Systems and Networks, 2004.

[31]  Sanjay J. Patel,et al.  Examining ACE analysis reliability estimates using fault-injection , 2007, ISCA '07.

[32]  Joel S. Emer,et al.  Techniques to reduce the soft error rate of a high-performance microprocessor , 2004, Proceedings. 31st Annual International Symposium on Computer Architecture, 2004..

[33]  Vijay S. Pande,et al.  Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU , 2009, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[34]  J. Maiz,et al.  Characterization of multi-bit soft error events in advanced SRAMs , 2003, IEEE International Electron Devices Meeting 2003.

[35]  John Sartori,et al.  Low-power, low-storage-overhead chipkill correct via multi-line error correction , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[36]  E. Macii,et al.  Power and timing modelling, optimisation and simulation , 2005 .

[37]  Sudhanva Gurumurthi,et al.  Dynamic prediction of architectural vulnerability from microarchitectural state , 2007, ISCA '07.

[38]  E. Normand Single event upset at ground level , 1996 .

[39]  Charles Slayman,et al.  Soft error trends and mitigation techniques in memory devices , 2011, 2011 Proceedings - Annual Reliability and Maintainability Symposium.

[40]  Kevin Skadron,et al.  Evaluating Overheads of Multibit Soft-Error Protection in the Processor Core , 2013, IEEE Micro.

[41]  John Lach,et al.  Transient fault models and AVF estimation revisited , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[42]  M. Annavaram,et al.  Soft error benchmarking of L2 caches with PARMA , 2011, PERV.