Hardware Versus Software Fault Injection of Modern Undervolted SRAMs

To improve power efficiency, researchers are experimenting with dynamically adjusting the supply voltage of systems below the nominal operating points. However, production systems are typically not allowed to function on voltage settings that is below the reliable limit. Consequently, existing software fault tolerance studies are based on fault models, which inject faults on random fault locations using fault injection techniques. In this work we study whether random fault injection is accurate to simulate the behavior of undervolted SRAMs. Our study extends the Gem5 simulator to support fault injection on the caches of the simulated system. The fault injection framework uses fault maps, which describe the faulty bits of SRAMs, as inputs. To compare random fault injection and hardware guided fault injection, we use two types of fault maps. The first type of maps are created through undervolting real SRAMs and observing the location of the erroneous bits, whereas the second type of maps are created by corrupting random bits of the SRAMs. During our study we corrupt the L1-Dcache of the simulated system and we monitor the behavior of the two types of fault maps on the resiliency of six benchmarks. The difference among the resiliency of a benchmark when tested with the different fault maps can be up to 24%.

[1]  Radu Teodorescu,et al.  Dynamic reduction of voltage margins by leveraging on-chip ECC in Itanium II processors , 2013, ISCA.

[2]  John Kalamatianos,et al.  On characterizing near-threshold SRAM failures in FinFET technology , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[3]  Spyros Lalis,et al.  Significance-Aware Program Execution on Unreliable Hardware , 2017, ACM Trans. Archit. Code Optim..

[4]  Osman S. Unsal,et al.  Comprehensive Evaluation of Supply Voltage Underscaling in FPGA on-Chip Memories , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Manolis Vavalis Hybrid-numerical-PDE-solvers: Hybrid Elliptic PDE Solvers , 2014 .

[6]  Spyros Lalis,et al.  A Framework for Evaluating Software on Reduced Margins Hardware , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN).

[7]  Somayeh Sardashti,et al.  The gem5 simulator , 2011, CARN.

[8]  Mehdi B. Tahoori,et al.  An Experimental Evaluation and Analysis of Transient Voltage Fluctuations in FPGAs , 2018, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[9]  Antonio María González Colás,et al.  Low Vccmin fault-tolerant cache with highly predictable performance , 2009, MICRO 2009.

[10]  Shidhartha Das,et al.  A Self-Tuning Dynamic Voltage Scaled Processor Using Delay-Error Detection and Correction , 2006, 2006 IEEE International Conference on IC Design and Technology.

[11]  Radu Teodorescu,et al.  Using ECC Feedback to Guide Voltage Speculation in Low-Voltage Processors , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[12]  Rakesh Kumar,et al.  Rescuing Uncorrectable Fault Patterns in On-Chip Memories through Error Pattern Transformation , 2016, ISCA.

[13]  Martin C. Rinard,et al.  Approximate computation with outlier detection in Topaz , 2015, OOPSLA.

[14]  Shidhartha Das,et al.  Harnessing Voltage Margins for Energy Efficiency in Multicore CPUs , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15]  Touradj Ebrahimi,et al.  The JPEG 2000 still image compression standard , 2001, IEEE Signal Process. Mag..

[16]  Dimitris Gizopoulos,et al.  Assessing the Effects of Low Voltage in Branch Prediction Units , 2019, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[17]  Alaa R. Alameldeen,et al.  Trading off Cache Capacity for Reliability to Enable Low Voltage Operation , 2008, 2008 International Symposium on Computer Architecture.

[18]  Wei Wu,et al.  Improving cache lifetime reliability at ultra-low voltages , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[19]  Kai Li,et al.  The PARSEC benchmark suite: Characterization and architectural implications , 2008, 2008 International Conference on Parallel Architectures and Compilation Techniques (PACT).

[20]  Thierry Moreau,et al.  Energy-Efficient Neural Network Acceleration in the Presence of Bit-Level Memory Errors , 2018, IEEE Transactions on Circuits and Systems I: Regular Papers.

[21]  Osman S. Unsal,et al.  Evaluating Built-In ECC of FPGA On-Chip Memories for the Mitigation of Undervolting Faults , 2019, 2019 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP).