Software-Based ECC for GPUs

Commodity off-the-shelf GPUs lack error checking mechanisms for graphics memory, whereas conventional HPC platforms have used hardware-based ECC for DRAMs. To alleviate this reliability concern, we propose a software-based ECC for GPGPU applications. We add small program codes to normal CUDA programs that compute ECCs for data residing in graphics memory so that transient bit-flips can be detected or masked. Preliminary performance studies with 3-D FFT and the N-body problem show that error checking using ECC can take 200% and 7% of overhead, respectively. We discuss that performance overheads are derived from the cost of ECC computation on GPUs.

[1]  Huiyang Zhou,et al.  Understanding software approaches for GPGPU reliability , 2009, GPGPU-2.

[2]  藤原 英二,et al.  Code design for dependable systems : theory and practical applications , 2006 .

[3]  Satoshi Matsuoka,et al.  Bandwidth intensive 3-D FFT kernel for GPUs using CUDA , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.