Evaluating the soft error sensitivity of a GPU-based SoC for matrix multiplication

Abstract System-on-Chip (SoC) devices can be composed of low-power multicore processors combined with a small graphics accelerator (or GPU) which offers a trade-off between computational capacity and low-power consumption. In this work we use the LLFI-GPU fault injection tool on one of these devices to compare the sensitivity to soft errors of two different CUDA versions of matrix multiplication benchmark. Specifically, we perform fault injection campaigns on a Jetson TK1 development kit, a board equipped with a SoC including an NVIDIA “Kepler” Graphics Processing Unit (GPU). We evaluate the effect of modifying the size of the problem and also the thread-block size on the behaviour of the algorithms. Our results show that the block version of the matrix multiplication benchmark that leverages the shared memory of the GPU is not only faster than the element-wise version, but it is also much more resilient to soft errors. We also use the cuda-gdb debugger to analyze the main causes of the crashes in the code due to soft errors. Our experiments show that most of the errors are due to accesses to invalid positions of the different memories of the GPU, which causes that the block version suffers a higher percentage of this kind of errors.

[1]  Israel Koren,et al.  CAROL-FI: an Efficient Fault-Injection Tool for Vulnerability Evaluation of Modern HPC Parallel Accelerators , 2017, Conf. Computing Frontiers.

[2]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[3]  Luigi Carro,et al.  GPGPUs: How to combine high computational power with high reliability , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[4]  Luigi Carro,et al.  Modern GPUs Radiation Sensitivity Evaluation and Mitigation Through Duplication With Comparison , 2014, IEEE Transactions on Nuclear Science.

[5]  Luigi Carro,et al.  Evaluation and Mitigation of Soft-Errors in Neural Network-Based Object Detection in Three GPU Architectures , 2017, 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).

[6]  Pradip Bose,et al.  Understanding Error Propagation in GPGPU Applications , 2016, SC16: International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  Bo Fang,et al.  GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[8]  Paolo Rech,et al.  Impact of Tensor Cores and Mixed Precision on the Reliability of Matrix Multiplication in GPUs , 2020, IEEE Transactions on Nuclear Science.

[9]  Paolo Rech,et al.  Reliability Evaluation of Mixed-Precision Architectures , 2019, 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[10]  Luigi Carro,et al.  Analyzing and Increasing the Reliability of Convolutional Neural Networks on GPUs , 2019, IEEE Transactions on Reliability.

[11]  Luigi Carro,et al.  On the evaluation of soft-errors detection techniques for GPGPUs , 2013, 2013 8th IEEE Design and Test Symposium.

[12]  Francisco J. Cazorla,et al.  GPU4S: Embedded GPUs in Space , 2019, 2019 22nd Euromicro Conference on Digital System Design (DSD).

[13]  Luigi Carro,et al.  Radiation Sensitivity of High Performance Computing Applications on Kepler-Based GPGPUs , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[14]  Karthik Pattabiraman,et al.  Quantifying the Accuracy of High-Level Fault Injection Techniques for Hardware Faults , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[15]  Stephen W. Keckler,et al.  SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[16]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[17]  Luigi Carro,et al.  Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[18]  Luigi Carro,et al.  Impact of Reduced Precision in the Reliability of Deep Neural Networks for Object Detection , 2019, 2019 IEEE European Test Symposium (ETS).

[19]  Jürgen Teich,et al.  Convoy tracking for ADAS on embedded GPUs , 2017, 2017 IEEE Intelligent Vehicles Symposium (IV).

[20]  Alan George,et al.  Exploration of TMR fault masking with persistent threads on Tegra GPU SoCs , 2017, 2017 IEEE Aerospace Conference.

[21]  Luigi Carro,et al.  Impact of GPUs Parallelism Management on Safety-Critical and HPC Applications Reliability , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[22]  Xiangyu Li,et al.  PRISM: Predicting Resilience of GPU Applications Using Statistical Methods , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.