In light of its powerful computing capacity and high energy efficiency, GPU (graphics processing unit) has become a focus in the research field of HPC (High Performance Computing). CPU-GPU heterogeneous parallel systems have become a new development trend of super-computer. However, the inherent unreliability of the GPU hardware deteriorates the reliability of super-computer. We have researched on the fault-tolerance(FT) technique for CPU-GPU heterogeneous parallel systems, and introduced a new checkpointing mechanism, i.e., the hierarchical application-level checkpointing, for such systems. The basic idea of this new checkpointing mechanism is checkpointing at two independent levels, i.e., CPU level and GPU level, to tolerate CPU and GPU faults respectively. Based on the idea, we have also designed and implemented a hierarchical application-level checkpointing tool ”HiAL-Ckpt”. Using this tool, programmers can insert two kinds of directives, i.e., CPU directives and GPU directives into a program, and the compiler will transform the directives into CPU or GPU checkpointing codes according to their nature. From the case study of SWIM, a test bench from spec2000 benchmark suite, we have demonstrated the validity of the hierarchical application-level checkpointing technique. The experimental results show that the falut-tolerance temporal cost of HiAL-Ckpt for SWIM is only 2.25%, compared with the executing time of SWIM without any FT work.
[1]
Arie E. Kaufman,et al.
GPU Cluster for High Performance Computing
,
2004,
Proceedings of the ACM/IEEE SC2004 Conference.
[2]
Sudhanva Gurumurthi,et al.
Towards Transient Fault Tolerance for Heterogeneous Computing Platforms
,
2008
.
[3]
John Paul Walters,et al.
Application-Level Checkpointing Techniques for Parallel Programs
,
2006,
ICDCIT.
[4]
Kevin Skadron,et al.
A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors
,
2007,
GH '07.
[5]
Huiyang Zhou,et al.
Understanding software approaches for GPGPU reliability
,
2009,
GPGPU-2.
[6]
Peter K. Szwed,et al.
Application-level checkpointing for shared memory programs
,
2004,
ASPLOS XI.
[7]
Michael Treaster,et al.
A Survey of Fault-Tolerance and Fault-Recovery Techniques in Parallel Systems
,
2004,
ArXiv.
[8]
Jens H. Krüger,et al.
GPGPU: general purpose computation on graphics hardware
,
2004,
SIGGRAPH '04.