论文信息 - Coarse grain computation-communication overlap for efficient application-level checkpointing for GPUs

Coarse grain computation-communication overlap for efficient application-level checkpointing for GPUs

Graphics Processing Units (GPUs) are increasingly used to solve non-graphical scientific problems. However, it has been shown that the reliability of the GPUs is a concern because of the occurrence of the soft and hard errors. The checkpoint/restart is the most commonly used technique to achieve fault tolerance in the presence of failures. This work present an application-level checkpoint scheme for systems composed of GPUs. Our scheme exploits the benefits of the divide-and-conquer technique and of the communication-computation overlapping to improve the execution time and checkpoint overhead. By dividing the problem and checkpointing in n subprocesses, we show that our scheme improves the checkpoint overhead by a factor of n. We also show that dividing the problem with finer granularity is not beneficial.

[1] Peter K. Szwed,et al. Application-level checkpointing for shared memory programs , 2004, ASPLOS XI.

[2] John Owens,et al. Streaming architectures and technology trends , 2005, SIGGRAPH Courses.

[3] Vijay S. Pande,et al. Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU , 2009, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[4] Kai Li,et al. Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[5] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[6] Arun K. Somani,et al. Achieving Robustness and Minimizing Overhead in Parallel Algorithms Through Overlapped Communication/Computation , 2000, The Journal of Supercomputing.

[7] Song Jiang,et al. Current practice and a direction forward in checkpoint/restart implementations for fault tolerance , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[8] Hiroaki Kobayashi,et al. CheCUDA: A Checkpoint/Restart Tool for CUDA Applications , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[9] Mark J. Harris,et al. Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware , 2007, Graphics Hardware.

[10] Pat Hanrahan,et al. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication , 2004, Graphics Hardware.

[11] Hai Jiang,et al. Compile/run-time support for thread migration , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[12] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[13] Song Jiang,et al. Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[14] Miron Livny,et al. Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[15] Eric Roman. A Survey of Checkpoint / Restart Implementations , 2002 .

[16] John Paul Walters,et al. Application-Level Checkpointing Techniques for Parallel Programs , 2006, ICDCIT.

[17] Jens H. Krüger,et al. A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[18] Rida A. Bazzi,et al. Compiler-assisted heterogeneous checkpointing , 2001, Proceedings 20th IEEE Symposium on Reliable Distributed Systems.