Modern GPUs Radiation Sensitivity Evaluation and Mitigation Through Duplication With Comparison

Graphics processing units (GPUs) are increasingly common in both safety-critical and high-performance computing (HPC) applications. Some current supercomputers are composed of thousands of GPUs so the probability of device corruption becomes very high. Moreover, the GPU's parallel capabilities are very attractive for the automotive and aerospace markets, where reliability is a serious concern. In this paper, the neutron sensitivity of the modern GPU caches, and internal resources are experimentally evaluated. Various Duplication With Comparison strategies to reduce GPU radiation sensitivity are then presented and validated through radiation experiments. Threads should be carefully duplicated to avoid undesired errors on shared resources and to avoid the exacerbation of errors in critical resources such as the scheduler.

[1]  Jacob A. Abraham,et al.  Fault-Tolerant FFT Networks , 1988, IEEE Trans. Computers.

[2]  Lloyd W. Massengill,et al.  Impact of scaling on soft-error rates in commercial microprocessors , 2002 .

[3]  Luigi Carro,et al.  Impact of GPUs Parallelism Management on Safety-Critical and HPC Applications Reliability , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[4]  S. Pontarelli,et al.  A New Hardware/Software Platform and a New 1/E Neutron Source for Soft Error Studies: Testing FPGAs at the ISIS Facility , 2007, IEEE Transactions on Nuclear Science.

[5]  L. Carro,et al.  Software-Based Hardening Strategies for Neutron Sensitive FFT Algorithms on GPUs , 2014, IEEE Transactions on Nuclear Science.

[6]  Heather Quinn,et al.  Compendium of TID, Neutron, Proton and Heavy Ion Testing of Satellite Electronics for Los Alamos National Laboratory , 2013, 2013 IEEE Radiation Effects Data Workshop (REDW).

[7]  Heather Quinn,et al.  Single-Event Effects in Low-Cost, Low-Power Microprocessors , 2014, 2014 IEEE Radiation Effects Data Workshop (REDW).

[8]  Luigi Carro,et al.  GPGPUs: How to combine high computational power with high reliability , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[9]  Melvin A. Breuer,et al.  Defect and error tolerance in the presence of massive numbers of defects , 2004, IEEE Design & Test of Computers.

[10]  Heather Quinn,et al.  A Method and Case Study on Identifying Physically Adjacent Multiple-Cell Upsets Using 28-nm, Interleaved and SECDED-Protected Arrays , 2014, IEEE Transactions on Nuclear Science.

[11]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[12]  Bo Fang,et al.  GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[13]  Erik Lindholm,et al.  NVIDIA Tesla: A Unified Graphics and Computing Architecture , 2008, IEEE Micro.

[14]  Luigi Carro,et al.  Radiation Sensitivity of High Performance Computing Applications on Kepler-Based GPGPUs , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[15]  L. Carro,et al.  An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs , 2013, IEEE Transactions on Nuclear Science.

[16]  Luigi Carro,et al.  Measuring the Radiation Reliability of SRAM Structures in GPUs Designed for HPC , 2014 .

[17]  Steven M. Guertin SOC SEE Test Guideline Development , 2013 .

[18]  Luigi Carro,et al.  Threads Distribution Effects on Graphics Processing Units Neutron Sensitivity , 2013, IEEE Transactions on Nuclear Science.

[19]  Shanghai Jiao,et al.  The Construction of a Williams Design and Randomization in Cross-Over Clinical Trials , 2009 .

[20]  Kevin Skadron,et al.  A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[21]  Luigi Carro,et al.  Evaluating the radiation sensitivity of GPGPU caches: New algorithms and experimental results , 2014, Microelectron. Reliab..