Coarse grain computation-communication overlap for efficient application-level checkpointing for GPUs

Graphics Processing Units (GPUs) are increasingly used to solve non-graphical scientific problems. However, it has been shown that the reliability of the GPUs is a concern because of the occurrence of the soft and hard errors. The checkpoint/restart is the most commonly used technique to achieve fault tolerance in the presence of failures. This work present an application-level checkpoint scheme for systems composed of GPUs. Our scheme exploits the benefits of the divide-and-conquer technique and of the communication-computation overlapping to improve the execution time and checkpoint overhead. By dividing the problem and checkpointing in n subprocesses, we show that our scheme improves the checkpoint overhead by a factor of n. We also show that dividing the problem with finer granularity is not beneficial.

[1]  Peter K. Szwed,et al.  Application-level checkpointing for shared memory programs , 2004, ASPLOS XI.

[2]  John Owens,et al.  Streaming architectures and technology trends , 2005, SIGGRAPH Courses.

[3]  Vijay S. Pande,et al.  Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU , 2009, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[4]  Kai Li,et al.  Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.

[5]  Jacob A. Abraham,et al.  Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.

[6]  Arun K. Somani,et al.  Achieving Robustness and Minimizing Overhead in Parallel Algorithms Through Overlapped Communication/Computation , 2000, The Journal of Supercomputing.

[7]  Song Jiang,et al.  Current practice and a direction forward in checkpoint/restart implementations for fault tolerance , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[8]  Hiroaki Kobayashi,et al.  CheCUDA: A Checkpoint/Restart Tool for CUDA Applications , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[9]  Mark J. Harris,et al.  Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware , 2007, Graphics Hardware.

[10]  Pat Hanrahan,et al.  Understanding the efficiency of GPU algorithms for matrix-matrix multiplication , 2004, Graphics Hardware.

[11]  Hai Jiang,et al.  Compile/run-time support for thread migration , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[12]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.

[13]  Song Jiang,et al.  Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[14]  Miron Livny,et al.  Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .

[15]  Eric Roman A Survey of Checkpoint / Restart Implementations , 2002 .

[16]  John Paul Walters,et al.  Application-Level Checkpointing Techniques for Parallel Programs , 2006, ICDCIT.

[17]  Jens H. Krüger,et al.  A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.

[18]  Rida A. Bazzi,et al.  Compiler-assisted heterogeneous checkpointing , 2001, Proceedings 20th IEEE Symposium on Reliable Distributed Systems.