Analyzing the Sensitivity of GPU Pipeline Registers to Single Events Upsets

Graphics processing units are available solutions for high-performance safety-critical applications, such as self-driving cars. In this application domain, functional-safety and reliability are major concerns. Thus, the adoption of fault tolerance techniques is mandatory to detect or correct faults, since these devices must work properly, even when faults are present. GPUs are designed and implemented with cutting-edge technologies, which makes them sensitive to faults caused by radiation interference, such as single event upsets. These effects can lead the system to a failure, which is unacceptable in safety-critical applications. Therefore, effective detection and mitigation strategies must be adopted to harden the GPU operation. In this paper, we analyze transient effects in the pipeline registers of a GPU architecture. We run four applications at three GPU configurations, considering the source of the fault, its effect on the GPU, and the use of software-based hardening techniques. The evaluation was performed using a general-purpose soft-core GPU based on the NVIDIA G80 architecture. Results can guide designers in building more resilient GPU architectures.

[1]  Fernanda Gusmão de Lima Kastensmidt,et al.  Evaluating the effects of single event upsets in soft-core GPGPUs , 2016, 2016 17th Latin-American Test Symposium (LATS).

[2]  L. Carro,et al.  Software-Based Hardening Strategies for Neutron Sensitive FFT Algorithms on GPUs , 2014, IEEE Transactions on Nuclear Science.

[3]  Luigi Carro,et al.  Hardware and Software Transparency in the Protection of Programs Against SEUs and SETs , 2008, J. Electron. Test..

[4]  Matteo Sonza Reorda,et al.  An extended model to support detailed GPGPU reliability analysis , 2019, 2019 14th International Conference on Design & Technology of Integrated Systems In Nanoscale Era (DTIS).

[5]  Lloyd W. Massengill,et al.  Basic mechanisms and modeling of single-event upset in digital microelectronics , 2003 .

[6]  L. Carro,et al.  An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs , 2013, IEEE Transactions on Nuclear Science.

[7]  Russell Tessier,et al.  FlexGrip: A soft GPGPU for FPGAs , 2013, 2013 International Conference on Field-Programmable Technology (FPT).

[8]  Alan Wood,et al.  The impact of new technology on soft error rates , 2011, 2011 International Reliability Physics Symposium.

[9]  Raoul Velazco,et al.  A Survey on Fault Injection Techniques , 2004, Int. Arab J. Inf. Technol..

[10]  Fernanda Gusmão de Lima Kastensmidt,et al.  A low-level software-based fault tolerance approach to detect SEUs in GPUs' register files , 2017, Microelectron. Reliab..

[11]  Matteo Sonza Reorda,et al.  FlexGripPlus: An improved GPGPU model to support reliability analysis , 2020, Microelectronics Reliability.

[12]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[13]  Matteo Sonza Reorda,et al.  Untestable faults identification in GPGPUs for safety-critical applications , 2019, 2019 26th IEEE International Conference on Electronics, Circuits and Systems (ICECS).

[14]  Ravishankar K. Iyer,et al.  Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[15]  Luca Sterpone,et al.  Evaluating Software-based Hardening Techniques for General-Purpose Registers on a GPGPU , 2020, 2020 IEEE Latin-American Test Symposium (LATS).

[16]  Matteo Sonza Reorda,et al.  Testing permanent faults in pipeline registers of GPGPUs: A multi-kernel approach , 2019, 2019 IEEE 25th International Symposium on On-Line Testing and Robust System Design (IOLTS).

[17]  Stephen W. Keckler,et al.  Optimizing Software-Directed Instruction Replication for GPU Error Detection , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[18]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[19]  E. Ibe,et al.  Impact of Scaling on Neutron-Induced Soft Error in SRAMs From a 250 nm to a 22 nm Design Rule , 2010, IEEE Transactions on Electron Devices.

[20]  Paolo Prinetto,et al.  Increasing the robustness of CUDA Fermi GPU-based systems , 2013, 2013 IEEE 19th International On-Line Testing Symposium (IOLTS).