Evaluating the radiation sensitivity of GPGPU caches: New algorithms and experimental results

Abstract Given their high computational power, General Purpose Graphics Processing Units (GPGPUs) are increasingly adopted: GPGPUs have begun to be preferred to CPUs for several computationally intensive applications, not necessarily related to computer graphics. However, their sensitivity to radiation still requires to be fully evaluated. In this context, GPGPU data caches and shared memory have a key role since they allow to increase performance by sharing data between the parallel resources of a GPGPU and minimizing the memory accesses overhead. In this paper we present three new algorithms designed to support radiation experiments aimed at evaluating the radiation sensitivity of GPGPU data caches and shared memory. We also report the cross-section and Failure In Time results from neutron testing experiments performed on a commercial-off-the-shelf GPGPU using the proposed algorithms, with particular emphasis on the shared memory and on the L1 and L2 data caches.

[1]  Luigi Carro,et al.  Experimental evaluation of thread distribution effects on multiple output errors in GPUs , 2013, 2013 18th IEEE European Test Symposium (ETS).

[2]  Tao Tang,et al.  Cache Miss Analysis for GPU Programs Based on Stack Distance Profile , 2011, 2011 31st International Conference on Distributed Computing Systems.

[3]  Luigi Carro,et al.  Neutron radiation test of graphic processing units , 2012, 2012 IEEE 18th International On-Line Testing Symposium (IOLTS).

[4]  S. Gerardin,et al.  Effectiveness of TMR-based techniques to mitigate alpha-induced SEU accumulation in commercial SRAM-based FPGAs , 2007, 2007 9th European Conference on Radiation and Its Effects on Components and Systems.

[5]  Alessandro Savino,et al.  Software-Based Self-Test of Set-Associative Cache Memories , 2011, IEEE Transactions on Computers.

[6]  R.C. Baumann,et al.  Radiation-induced soft errors in advanced semiconductor technologies , 2005, IEEE Transactions on Device and Materials Reliability.

[7]  Luigi Carro,et al.  Neutron sensitivity and software hardening strategies for matrix multiplication and FFT on graphics processing units , 2013, FTXS '13.

[8]  Charles Slayman,et al.  Soft error trends and mitigation techniques in memory devices , 2011, 2011 Proceedings - Annual Reliability and Maintainability Symposium.

[9]  Shuai Wang,et al.  Replicating Tag Entries for Reliability Enhancement in Cache Tag Arrays , 2012, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[10]  Claus Braun,et al.  Efficacy and efficiency of algorithm-based fault-tolerance on GPUs , 2013, 2013 IEEE 19th International On-Line Testing Symposium (IOLTS).

[11]  Paolo Prinetto,et al.  A software-based self test of CUDA Fermi GPUs , 2013, 2013 18th IEEE European Test Symposium (ETS).

[12]  L. Carro,et al.  An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs , 2013, IEEE Transactions on Nuclear Science.

[13]  S. Pontarelli,et al.  A New Hardware/Software Platform and a New 1/E Neutron Source for Soft Error Studies: Testing FPGAs at the ISIS Facility , 2007, IEEE Transactions on Nuclear Science.

[14]  Paolo Prinetto,et al.  Increasing the robustness of CUDA Fermi GPU-based systems , 2013, 2013 IEEE 19th International On-Line Testing Symposium (IOLTS).

[15]  Xin Fu,et al.  Analyzing soft-error vulnerability on GPGPU microarchitecture , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[16]  Luigi Carro,et al.  On the evaluation of soft-errors detection techniques for GPGPUs , 2013, 2013 8th IEEE Design and Test Symposium.

[17]  Nam Sung Kim,et al.  Low-voltage on-chip cache architecture using heterogeneous cell sizes for high-performance processors , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[18]  Dimitris Gizopoulos,et al.  The functional and performance tolerance of GPUs to permanent faults in registers , 2013, 2013 IEEE 19th International On-Line Testing Symposium (IOLTS).