Multi-Bit Upsets Vulnerability Analysis of Modern Microprocessors

Miniaturization of integrated circuits brings more devices (thus more functionality) on the same silicon area but also makes them more vulnerable to soft (transient) errors. Assessment and understanding of the magnitude of a microprocessor's vulnerability to soft errors in early stages of the design can steer wise, cost-effective protection decision at the hardware or software level. In recent fabrication technologies, the effect of radiation (neutrons or other particles) is significantly more severe on silicon devices and leads to increased numbers of multi-bit upsets. In this paper, we analyze the effects of multi-bit upsets in modern microprocessors, using microarchitecture level fault injection and a complete system stack. We present details about the effects of multi-bit upsets on 6 major hardware components of an ARM Cortex-A9 CPU modeled on Gem5 microarchitectural simulator, with 15 workloads across 8 fabrication technology nodes. For the purposes of our analysis, we employ and extend the GeFIN (Gem5-based Fault INjector) framework to model and analyze multi-bit faults in the hardware structures of the CPU. The enhanced version of the fault injector models multi-bit faults in adjacent areas of a structure; a very realistic case when modern silicon chips are affected by radiation. Our analysis shows that the architectural vulnerability factor (AVF) significantly increases from 1.5x (+50%) to 3.2x (+220%) between single and triple-bit faults across components. We present the aggregate multi-bit AVF of each hardware structure and each technology node from 250nm to 22nm; our results show significant AVF difference between single bit and aggregate multi-bit measurements, up to 35% as the technology node decreases - this reveals the magnitude of the assessment gap when only single bit errors are considered by any method. We report soft error Failures in Time (FIT) rates for the entire ARM Cortex-A9 CPU across technology nodes and our results show that the contribution of multi-bit upsets in the overall CPU FIT consistently increases across technologies and reaches 21% in 22nm.

[1]  Michail Maniatakos,et al.  Multiple-Bit Upset Protection in Microprocessor Memory Arrays Using Vulnerability-Based Parity Optimization and Interleaving , 2015, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[2]  Dimitris Gizopoulos,et al.  Demystifying Soft Error Assessment Strategies on ARM CPUs: Microarchitectural Fault Injection vs. Neutron Beam Experiments , 2019, 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[3]  Adrian Evans,et al.  Comprehensive Analysis of Sequential and Combinational Soft Errors in an Embedded Processor , 2015, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[4]  Roger Johansson,et al.  A Study of the Impact of Single Bit-Flip and Double Bit-Flip Errors on Program Execution , 2013, SAFECOMP.

[5]  Stefano Di Carlo,et al.  RT Level vs. Microarchitecture-Level Reliability Assessment: Case Study on ARM(R) Cortex(R)-A9 CPU , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).

[6]  Michail Maniatakos,et al.  Revisiting Vulnerability Analysis in Modern Microprocessors , 2015, IEEE Transactions on Computers.

[7]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[8]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[9]  Yu Hu,et al.  IVF: Characterizing the vulnerability of microprocessor structures to intermittent faults , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[10]  Andreas Gerstlauer,et al.  Host-Compiled Reliability Modeling for Fast Estimation of Architectural Vulnerabilities , 2015 .

[11]  Dimitris Gizopoulos,et al.  Assessing the impact of hard faults in performance components of modern microprocessors , 2013, 2013 IEEE 31st International Conference on Computer Design (ICCD).

[12]  John Lach,et al.  Bit-slice logic interleaving for spatial multi-bit soft-error tolerance , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[13]  Régis Leveugle,et al.  Statistical fault injection: Quantified error and confidence , 2009, 2009 Design, Automation & Test in Europe Conference & Exhibition.

[14]  John L. Henning SPEC CPU2006 benchmark descriptions , 2006, CARN.

[15]  Dimitris Gizopoulos,et al.  Differential Fault Injection on Microarchitectural Simulators , 2015, 2015 IEEE International Symposium on Workload Characterization.

[16]  Yu Cao,et al.  A resilience roadmap , 2010, 2010 Design, Automation & Test in Europe Conference & Exhibition (DATE 2010).

[17]  Gianfranco Politano,et al.  Cross-layer system reliability assessment framework for hardware faults , 2016, 2016 IEEE International Test Conference (ITC).

[18]  Sule Ozev,et al.  Applying architectural vulnerability Analysis to hard faults in the microprocessor , 2006, SIGMETRICS '06/Performance '06.

[19]  David R. Kaeli,et al.  Eliminating microarchitectural dependency from Architectural Vulnerability , 2009, 2009 IEEE 15th International Symposium on High Performance Computer Architecture.

[20]  David R. Kaeli,et al.  Calculating Architectural Vulnerability Factors for Spatial Multi-Bit Transient Faults , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[21]  E. Amirante,et al.  Investigation of Increased Multi-Bit Failure Rate Due to Neutron Induced SEU in Advanced Embedded SRAMs , 2007, 2007 IEEE Symposium on VLSI Circuits.

[22]  Johan Karlsson,et al.  One Bit is (Not) Enough: An Empirical Study of the Impact of Single and Multiple Bit-Flip Errors , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[23]  Mattan Erez,et al.  Hamartia: A Fast and Accurate Error Injection Framework , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).

[24]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[25]  Lieven Eeckhout,et al.  AVF Stressmark: Towards an Automated Methodology for Bounding the Worst-Case Vulnerability to Soft Errors , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[26]  Alan Wood,et al.  The impact of new technology on soft error rates , 2011, 2011 International Reliability Physics Symposium.

[27]  Mateo Valero,et al.  FIMSIM: A fault injection infrastructure for microarchitectural simulators , 2011, 2011 IEEE 29th International Conference on Computer Design (ICCD).

[28]  E. Ibe,et al.  Impact of Scaling on Neutron-Induced Soft Error in SRAMs From a 250 nm to a 22 nm Design Rule , 2010, IEEE Transactions on Electron Devices.

[29]  A. Bosio,et al.  SyRA: Early System Reliability Analysis for Cross-Layer Soft Errors Resilience in Memory Arrays of Microprocessor Systems , 2019, IEEE Transactions on Computers.

[30]  Dimitris Gizopoulos,et al.  Analysis and Characterization of Ultra Low Power Branch Predictors , 2018, 2018 IEEE 36th International Conference on Computer Design (ICCD).

[31]  John Lach,et al.  Transient fault models and AVF estimation revisited , 2010, 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN).

[32]  Jie Liu,et al.  Characterizing Application Memory Error Vulnerability to Optimize Datacenter Cost via Heterogeneous-Reliability Memory , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[33]  Dimitris Gizopoulos,et al.  Assessing the Effects of Low Voltage in Branch Prediction Units , 2019, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[34]  Wei Wu,et al.  Improving cache lifetime reliability at ultra-low voltages , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[35]  Robert Baumann,et al.  Soft errors in advanced computer systems , 2005, IEEE Design & Test of Computers.

[36]  Gabriel L. Nazar,et al.  Precise evaluation of the fault sensitivity of OoO superscalar processors , 2018, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[37]  J. Maiz,et al.  Characterization of multi-bit soft error events in advanced SRAMs , 2003, IEEE International Electron Devices Meeting 2003.

[38]  Scott A. Mahlke,et al.  Harnessing Soft Computations for Low-Budget Fault Tolerance , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[39]  Karthik Pattabiraman,et al.  LLFI: An Intermediate Code-Level Fault Injection Tool for Hardware Faults , 2015, 2015 IEEE International Conference on Software Quality, Reliability and Security.

[40]  Dimitris Gizopoulos,et al.  Anatomy of microarchitecture-level reliability assessment: Throughput and accuracy , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[41]  Pradip Bose,et al.  Scaling of Architecture Level Soft Error Rate for Superscalar Processors ∗ , 2005 .

[42]  David R. Kaeli,et al.  Using hardware vulnerability factors to enhance AVF analysis , 2010, ISCA.

[43]  Arshad Jhumka,et al.  An Investigation of the Impact of Double Bit-Flip Error Variants on Program Execution , 2015, ICA3PP.

[44]  R. Allmon,et al.  Soft Error Susceptibilities of 22 nm Tri-Gate Devices , 2012, IEEE Transactions on Nuclear Science.

[45]  Norbert Wehn,et al.  A Cross-Layer Technology-Based Study of How Memory Errors Impact System Resilience , 2013, IEEE Micro.

[46]  Trevor Mudge,et al.  MiBench: A free, commercially representative embedded benchmark suite , 2001 .

[47]  Michel Dubois,et al.  MACAU: A Markov model for reliability evaluations of caches under Single-bit and Multi-bit Upsets , 2012, IEEE International Symposium on High-Performance Comp Architecture.