Optimizing Software-Directed Instruction Replication for GPU Error Detection

Application execution on safety-critical and high-performance computer systems must be resilient to transient errors. As GPUs become more pervasive in such systems, they must supplement ECC/parity for major storage structures with reliability techniques that cover more of the GPU hardware logic. Instruction duplication has been explored for CPU resilience; however, it has never been studied in the context of GPUs, and it is unclear whether the performance and design choices it presents make it a feasible GPU solution. This paper describes a practical methodology to employ instruction duplication for GPUs and identifies implementation challenges that can incur high overheads (69% on average). It explores GPU-specific software optimizations that trade fine-grained recoverability for performance. It also proposes simple ISA extensions with limited hardware changes and area costs to further improve performance, cutting the runtime overheads by more than half to an average of 30%.

[1]  Cristian Constantinescu,et al.  Trends and Challenges in VLSI Circuit Reliability , 2003, IEEE Micro.

[2]  Satoshi Matsuoka,et al.  NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[3]  Scott A. Mahlke,et al.  Runtime asynchronous fault tolerance via speculation , 2012, CGO '12.

[4]  Amin Ansari,et al.  Shoestring: probabilistic soft error reliability on the cheap , 2010, ASPLOS XV.

[5]  Kevin Skadron,et al.  A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads , 2010, IEEE International Symposium on Workload Characterization (IISWC'10).

[6]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[7]  Michael Sullivan,et al.  CRUM: Checkpoint-Restart Support for CUDA's Unified Memory , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[8]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[9]  Kevin Skadron,et al.  Real-world design and evaluation of compiler-managed GPU redundant multithreading , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[10]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[11]  Huiyang Zhou,et al.  Understanding software approaches for GPGPU reliability , 2009, GPGPU-2.

[12]  Matsuoka Satoshi,et al.  MPI-CUDA Applications Checkpointing , 2010 .

[13]  Daniel J. Sorin,et al.  Argus-G: Comprehensive, Low-Cost Error Detection for GPGPU Cores , 2015, IEEE Computer Architecture Letters.

[14]  Mattan Erez,et al.  Hamartia: A Fast and Accurate Error Injection Framework , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).

[15]  Joel S. Emer,et al.  The soft error problem: an architectural perspective , 2005, 11th International Symposium on High-Performance Computer Architecture.

[16]  J. Xu OpenCL – The Open Standard for Parallel Programming of Heterogeneous Systems , 2009 .

[17]  Xin Fu,et al.  RISE: Improving the streaming processors reliability against soft errors in GPGPUs , 2012, 2012 21st International Conference on Parallel Architectures and Compilation Techniques (PACT).

[18]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[19]  Hyeran Jeon,et al.  Warped-DMR: Light-weight Error Detection for GPGPU , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[20]  Yi Yang,et al.  Revisiting ILP Designs for Throughput-Oriented GPGPU Architecture , 2015, 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[21]  Philip Koopman,et al.  Selection of Cyclic Redundancy Code and Checksum Algorithms to Ensure Critical Data Integrity , 2015 .

[22]  Lisa Spainhower,et al.  Commercial fault tolerance: a tale of two systems , 2004, IEEE Transactions on Dependable and Secure Computing.

[23]  Fernanda Gusmão de Lima Kastensmidt,et al.  Implementation and experimental evaluation of a CUDA core under single event effects , 2014, 2014 15th Latin American Test Workshop - LATW.

[24]  Edward J. McCluskey,et al.  Control-flow checking by software signatures , 2002, IEEE Trans. Reliab..

[25]  Meeta Sharma Gupta,et al.  Error Tolerance in Server Class Processors , 2011, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[26]  David W. Nellans,et al.  Flexible software profiling of GPU architectures , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[27]  Tipp Moseley,et al.  Using Process-Level Redundancy to Exploit Multiple Cores for Transient Fault Tolerance , 2007, 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07).

[28]  Luigi Carro,et al.  GPGPUs ECC efficiency and efficacy , 2014, 2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT).

[29]  Hyeran Jeon,et al.  Warped-RE: Low-Cost Error Detection and Correction in GPUs , 2015, 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[30]  David I. August,et al.  Software-controlled fault tolerance , 2005, TACO.

[31]  M. Rimen,et al.  Implicit signature checking , 1995, Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers.

[32]  Cheng Wang,et al.  Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[33]  Josep Torrellas,et al.  InstantCheck: Checking the Determinism of Parallel Programs Using On-the-Fly Incremental Hashing , 2010, 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.

[34]  Rajesh K. Gupta,et al.  Compiler techniques to reduce the synchronization overhead of GPU redundant multithreading , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[35]  Shubhendu S. Mukherjee,et al.  Transient fault detection via simultaneous multithreading , 2000, Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201).

[36]  Shidhartha Das,et al.  A Triple Core Lock-Step (TCLS) ARM® Cortex®-R5 Processor for Safety-Critical and Ultra-Reliable Applications , 2016, 2016 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshop (DSN-W).

[37]  Hiroaki Kobayashi,et al.  CheCUDA: A Checkpoint/Restart Tool for CUDA Applications , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[38]  Keshav Pingali,et al.  A quantitative study of irregular programs on GPUs , 2012, 2012 IEEE International Symposium on Workload Characterization (IISWC).

[39]  Stephen W. Keckler,et al.  SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[40]  Tipp Moseley,et al.  PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures , 2009, IEEE Transactions on Dependable and Secure Computing.