Exploration of TMR fault masking with persistent threads on Tegra GPU SoCs

Low-power, high-performance, System-on-Chip (SoC) devices, such as the NVIDIA Tegra K1 and Tegra X1, have many potential uses in aerospace applications. Fusing ARM CPUs and a large GPU, Tegra SoCs are well suited for image and signal processing. However, fault masking and tolerance on GPUs is relatively unexplored for harsh environments. With hundreds of GPU cores, a complex caching structure, and a custom task scheduler, Tegra SoCs are vulnerable to a wide range of single-event upsets (SEUs). Triple-modular redundancy (TMR) provides a strong basis for fault masking on a wide range of devices. GPUs pose a unique challenge to a typical TMR implementation. NVIDIA's scheduler assigns tasks based on available resources, but the scheduling process is not publicly documented. As a result, a malfunctioning core could be assigned the same block of code in each TMR module. In this case, a fault could go undetected, impacting the resulting data with an error. Likewise, an upset in the scheduler or cache could have an adverse impact on data integrity. In order to mask and mitigate upsets in GPUs, we propose and investigate a new method that features persistent threading and CUDA Streams with TMR. A persistent thread is a new approach to GPU programming where a kernel's threads run indefinitely. CUDA Streams enable multiple kernels to run concurrently on a single GPU. Combining these two programming paradigms, we remove the vulnerability of scheduler faults, and ensure that each iteration is executed concurrently on different cores, with each instance having its own copy of the data. We evaluate our method with an experiment that uses a Sobel filter applied to a 640×480 image on an NVIDIA Tegra X1. In order to inject faults to verify our method, a separate task corrupts a memory location. Using this simple injector, we are able to simulate an upset in a GPU core or memory location. From this experiment, our results confirm that using persistent threading and CUDA Streams with TMR masks the simulated SEUs on the Tegra X1. Furthermore, we provide performance results to quantify the overhead with this new method.

[1]  Jeff A. Stuart,et al.  A study of Persistent Threads style GPU programming for GPGPU workloads , 2012, 2012 Innovative Parallel Computing (InPar).

[2]  Tulika Mitra,et al.  Improving GPGPU energy-efficiency through concurrent kernel execution and DVFS , 2015, 2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[3]  Hyeran Jeon,et al.  Warped-RE: Low-Cost Error Detection and Correction in GPUs , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[4]  Luigi Carro,et al.  Experimental evaluation of GPUs radiation sensitivity and algorithm-based fault tolerance efficiency , 2013, 2013 IEEE 19th International On-Line Testing Symposium (IOLTS).

[5]  Zizhong Chen,et al.  GPU-ABFT: Optimizing Algorithm-Based Fault Tolerance for Heterogeneous Systems with GPUs , 2016, 2016 IEEE International Conference on Networking, Architecture and Storage (NAS).

[6]  W. Marsden I and J , 2012 .

[7]  Nicola Capodieci,et al.  Efficient Implementation of Genetic Algorithms on GP-GPU with Scheduled Persistent CUDA Threads , 2015, 2015 Seventh International Symposium on Parallel Architectures, Algorithms and Programming (PAAP).

[8]  David W. Nellans,et al.  Flexible software profiling of GPU architectures , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[9]  L. Carro,et al.  An Efficient and Experimentally Tuned Software-Based Hardening Strategy for Matrix Multiplication on GPUs , 2013, IEEE Transactions on Nuclear Science.

[10]  Zizhong Chen,et al.  Online Algorithm-Based Fault Tolerance for Cholesky Decomposition on Heterogeneous Systems with GPUs , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).