Analyzing the criticality of transient faults-induced SDCS on GPU applications

In this paper we compare the soft-error sensitivity of parallel applications on modern Graphics Processing Units (GPUs) obtained through architectural-level fault injections and high-energy particle beam radiation experiments. Fault-injection and beam experiments provide different information and uses different transient-fault sensitivity metrics, which are hard to combine. In this paper we show how correlating beam and fault-injection data can provide a deeper understanding of the behavior of GPUs in the occurrence of transient faults. In particular, we demonstrate that commonly used architecture-level fault models (and fast injection tools) can be used to identify critical kernels and to associate some experimentally observed output errors with their causes. Additionally, we show how register file and instruction-level injections can be used to evaluate ECC efficiency in reducing the radiation-induced error rate.

[1]  Zhongliang Chen,et al.  NUPAR: A Benchmark Suite for Modern GPU Architectures , 2015, ICPE.

[2]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[3]  Luigi Carro,et al.  Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[4]  Ian Karlin,et al.  LULESH 2.0 Updates and Changes , 2013 .

[5]  Bronis R. de Supinski,et al.  Experiences with Achieving Portability across Heterogeneous Architectures , 2011 .

[6]  Bo Fang,et al.  GPU-Qin: A methodology for evaluating the error resilience of GPGPU applications , 2014, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[7]  Shohaib Aboobacker RAZOR: circuit-level correction of timing errors for low-power operation , 2011 .

[8]  B. L. Bhuva,et al.  Comparison of Combinational and Sequential Error Rates for a Deep Submicron Process , 2011, IEEE Transactions on Nuclear Science.

[9]  Stephen W. Keckler,et al.  SASSIFI: An architecture-level fault injection tool for GPU application resilience evaluation , 2017, 2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[10]  Tom R. Halfhill NVIDIA's Next-Generation CUDA Compute and Graphics Architecture, Code-Named Fermi, Adds Muscle for Parallel Processing , 2009 .

[11]  M. Lopez-Vallejo,et al.  System Design Framework and Methodology for Xilinx Virtex FPGA Configuration Scrubbers , 2014, IEEE Transactions on Nuclear Science.

[12]  Todd M. Austin,et al.  A Systematic Methodology to Compute the Architectural Vulnerability Factors for a High-Performance Microprocessor , 2003, MICRO.

[13]  Luigi Carro,et al.  Understanding GPU errors on large-scale HPC systems and the implications for system design and operation , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[14]  Michiel van Ratingen,et al.  The European New Car Assessment Programme , 2014 .

[15]  Ravishankar K. Iyer,et al.  An experimental study of soft errors in microprocessors , 2005, IEEE Micro.

[16]  Mauricio Hanzich,et al.  Mimetic seismic wave modeling including topography on deformed staggered grids , 2014 .

[17]  Xiaomei Yang Rounding Errors in Algebraic Processes , 1964, Nature.

[18]  Claus Braun,et al.  Efficacy and efficiency of algorithm-based fault-tolerance on GPUs , 2013, 2013 IEEE 19th International On-Line Testing Symposium (IOLTS).

[19]  David R. Kaeli,et al.  The Effect of Input Data on Program Vulnerability , 2009 .

[20]  Melvin A. Breuer,et al.  Defect and error tolerance in the presence of massive numbers of defects , 2004, IEEE Design & Test of Computers.

[21]  Dimitris Gizopoulos,et al.  GUFI: A framework for GPUs reliability assessment , 2016, 2016 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[22]  Franck Cappello,et al.  Addressing failures in exascale computing , 2014, Int. J. High Perform. Comput. Appl..

[23]  Laura Monroe,et al.  GPU Behavior on a Large HPC Cluster , 2013, Euro-Par Workshops.

[24]  Melvin A. Breuer,et al.  Multi-media applications and imprecise computation , 2005, 8th Euromicro Conference on Digital System Design (DSD'05).

[25]  R.C. Baumann,et al.  Radiation-induced soft errors in advanced semiconductor technologies , 2005, IEEE Transactions on Device and Materials Reliability.

[26]  David W. Nellans,et al.  Flexible software profiling of GPU architectures , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[27]  John Shalf,et al.  DOE Advanced Scientific Computing Advisory Subcommittee (ASCAC) Report: Top Ten Exascale Research Challenges , 2014 .

[28]  Thiago Santini,et al.  Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units , 2016, IEEE Transactions on Computers.

[29]  Luigi Carro,et al.  Evaluation of Histogram of Oriented Gradients Soft Errors Criticality for Automotive Applications , 2016, ACM Trans. Archit. Code Optim..

[30]  M. Baze,et al.  Comparison of error rates in combinational and sequential logic , 1997 .

[31]  Kevin Skadron,et al.  Rodinia: A benchmark suite for heterogeneous computing , 2009, 2009 IEEE International Symposium on Workload Characterization (IISWC).

[32]  William M. Jones,et al.  Towards Building Resilient Scientific Applications: Resilience Analysis on the Impact of Soft Error and Transient Error Tolerance with the CLAMR Hydrodynamics Mini-App , 2015, 2015 IEEE International Conference on Cluster Computing.

[33]  Luigi Carro,et al.  GPGPUs: How to combine high computational power with high reliability , 2014, 2014 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[34]  Ganesh Gopalakrishnan,et al.  Determinism and Reproducibility in Large-Scale HPC Systems , 2013 .