SwapCodes: Error Codes for Hardware-Software Cooperative GPU Pipeline Error Detection

Intra-thread instruction duplication offers straightforward and effective pipeline error detection for data-intensive processors. However, software-enforced instruction duplication uses explicit checking instructions, roughly doubles program register usage, and doubles the arithmetic operation count per thread, potentially leading to severe slowdowns. This paper investigates SwapCodes, a family of software-hardware cooperative mechanisms to accelerate intra-thread duplication in GPUs. SwapCodes leverages the register file ECC hardware to detect pipeline errors without sacrificing the ability of ECC to detect and correct storage errors. By implicitly checking for pipeline errors on each register read, SwapCodes avoids the overheads of instruction checking without adding new hardware error checkers or buffers. We describe a family of SwapCodes implementations that successively eliminate the sources of inefficiency in intra-thread duplication with different complexities and error detection and correction trade-offs. We apply SwapCodes to protect a GPU-based processor against pipeline errors, and demonstrate that it is able to detect more than 99.3% of pipeline errors while improving performance and system efficiency relative to software-enforced duplication—the most performant SwapCodes organization incurs just 15% average slowdown over the un-duplicated program.

[1]  Scott A. Mahlke,et al.  Runtime asynchronous fault tolerance via speculation , 2012, CGO '12.

[2]  Daniel J. Sorin,et al.  Argus-G: Comprehensive, Low-Cost Error Detection for GPGPU Cores , 2015, IEEE Computer Architecture Letters.

[3]  Jacob A. Abraham,et al.  Quantitative evaluation of soft error injection techniques for robust system design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[4]  James M. Caffrey The resiliency challenge presented by soft failure incidents , 2008, IBM Syst. J..

[5]  Bo Fang Error Resilience Evaluation on GPGPU Applications , 2014 .

[6]  T. R. N. Rao Error-Checking Logic for Arithmetic-Type Operations of a Processor , 1968, IEEE Transactions on Computers.

[7]  I. L. Sayers,et al.  Implementation of 32-bit RISC processor incorporating hardware concurrent error detection and correction , 1990 .

[8]  Mattan Erez,et al.  Hamartia: A Fast and Accurate Error Injection Framework , 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W).

[9]  Algirdas Avizienis,et al.  Arithmetic Error Codes: Cost and Effectiveness Studies for Application in Digital System Design , 1971, IEEE Transactions on Computers.

[10]  John F. Wakerly,et al.  Error detecting codes, self-checking circuits and applications , 1978 .

[11]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[12]  Thiago Santini,et al.  Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units , 2016, IEEE Transactions on Computers.

[13]  Jien-Chung Lo Reliable Floating-Point Arithmetic Algorithms for Error-Coded Operands , 1994, IEEE Trans. Computers.

[14]  Cheng Wang,et al.  Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection , 2007, International Symposium on Code Generation and Optimization (CGO'07).

[15]  E. G. Chester,et al.  Design of a reliable and self-testing VLSI datapath using residue coding techniques , 1986 .

[16]  W. F. Heida,et al.  Towards a fault tolerant RISC-V softcore , 2016 .

[17]  Tipp Moseley,et al.  PLR: A Software Approach to Transient Fault Tolerance for Multicore Architectures , 2009, IEEE Transactions on Dependable and Secure Computing.

[18]  Edward J. McCluskey,et al.  Error detection by duplicated instructions in super-scalar processors , 2002, IEEE Trans. Reliab..

[19]  Luigi Carro,et al.  GPGPUs ECC efficiency and efficacy , 2014, 2014 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT).

[20]  Jeffrey P. Kubala,et al.  IBM System z10 design for RAS , 2009, IBM J. Res. Dev..

[21]  Huiyang Zhou,et al.  Understanding software approaches for GPGPU reliability , 2009, GPGPU-2.

[22]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[23]  Albert Meixner,et al.  Argus: Low-Cost, Comprehensive Error Detection in Simple Cores , 2008, IEEE Micro.

[24]  Rajesh K. Gupta,et al.  Compiler techniques to reduce the synchronization overhead of GPU redundant multithreading , 2017, 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC).

[25]  Algirdas Avizienis,et al.  The STAR (Self-Testing And Repairing) Computer: An Investigation of the Theory and Practice of Fault-Tolerant Computer Design , 1971, IEEE Transactions on Computers.

[26]  Reto Zimmermann,et al.  Efficient VLSI implementation of modulo (2/sup n//spl plusmn/1) addition and multiplication , 1999, Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336).

[27]  M. Y. Hsiao,et al.  Reliability, Availability, and Serviceability of IBM Computer Systems: A Quarter Century of Progress , 1981, IBM J. Res. Dev..

[28]  William J. Dally,et al.  Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor , 2012, 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture.

[29]  Behrooz Parhami,et al.  Computer arithmetic - algorithms and hardware designs , 1999 .

[30]  David W. Nellans,et al.  Flexible software profiling of GPU architectures , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[31]  Luigi Carro,et al.  On the evaluation of soft-errors detection techniques for GPGPUs , 2013, 2013 8th IEEE Design and Test Symposium.

[32]  藤原 英二,et al.  Code design for dependable systems : theory and practical applications , 2006 .

[33]  Mohd Hafiz Sulaiman,et al.  A survey of fault-tolerant processor based on error correction code , 2014, 2014 IEEE Student Conference on Research and Development.

[34]  Stephen W. Keckler,et al.  SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[35]  Stanislaw J. Piestrak Design of Residue Generators and Multioperand Modular Adders Using Carry-Save Adders , 1994, IEEE Trans. Computers.

[36]  Dimitris Gizopoulos,et al.  MeRLiN: Exploiting dynamic instruction behavior for fast and accurate microarchitecture level reliability assessment , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[37]  M. Y. Hsiao,et al.  A class of optimal minimum odd-weight-column SEC-DED codes , 1970 .

[38]  Jinsuk Chung,et al.  Containment domains: a scalable, efficient, and flexible resilience scheme for exascale systems , 2012, HiPC 2012.

[39]  David I. August,et al.  SWIFT: software implemented fault tolerance , 2005, International Symposium on Code Generation and Optimization.

[40]  Michael Nicolaidis,et al.  Carry checking/parity prediction adders and ALUs , 2003, IEEE Trans. Very Large Scale Integr. Syst..

[41]  Eric Schwarz,et al.  Self Checking in Current Floating-Point Units , 2011, 2011 IEEE 20th Symposium on Computer Arithmetic.

[42]  Hossam A. H. Fahmy,et al.  Residue codes for error correction in a combined decimal/binary redundant floating point adder , 2012, 2012 Conference Record of the Forty Sixth Asilomar Conference on Signals, Systems and Computers (ASILOMAR).

[43]  Kevin Skadron,et al.  Real-world design and evaluation of compiler-managed GPU redundant multithreading , 2014, 2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA).

[44]  Eric Cheng,et al.  System-Level Effects of Soft Errors in Uncore Components , 2017, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[45]  Stephen W. Keckler,et al.  Optimizing Software-Directed Instruction Replication for GPU Error Detection , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.