A Checkpoint/Restart Scheme for CUDA Programs with Complex Computation States

Checkpoint/restart has been an effective mechanism to achieve fault tolerance for many long-running scientific applications. The common approach is to save computation states in memory and secondary storage for execution resumption. However, as the GPU plays a much bigger role in high performance computing, there is no effective checkpoint/restart scheme yet due to the difficulty of the GPU computation state handling. This paper proposes an application-level checkpoint/restart scheme to save and restore GPU computation states in annotated user programs. A pre-compiler and run-time support module are developed to construct and save states in CPU system memory dynamically, whereas secondary storage can be utilized for scalability and long-term fault tolerance. CUDA programs with complicated computation states are supported. State-related variables dissipated in various memory units are collected. Both stack and heap are duplicated at application level for state construction. Experimental results have demonstrated the effectiveness of the proposed scheme.

[1]  Daniel Marques,et al.  Automated application-level checkpointing of MPI programs , 2003, PPoPP '03.

[2]  Jason Duell,et al.  Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .

[3]  Song Jiang,et al.  Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[4]  Bran Selic,et al.  A Fault Tolerance Framework for High Performance Computing in Cloud , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[5]  M. Bozyigit,et al.  User-level process checkpoint and restore for migration , 2001, OPSR.

[6]  Jie Cheng,et al.  Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..

[7]  Tom Davis,et al.  Opengl programming guide: the official guide to learning opengl , 1993 .

[8]  Marjan Mernik,et al.  A technique for non-invasive application-level checkpointing , 2011, The Journal of Supercomputing.

[9]  Allen Sherrod,et al.  Beginning DirectX 11 Game Programming , 2011 .

[10]  Jason Sanders,et al.  CUDA by example: an introduction to general purpose GPU programming , 2010 .

[11]  Wenguang Chen,et al.  CprFS: a user-level file system to support consistent file states for checkpoint and restart , 2008, ICS '08.

[12]  George Bosilca,et al.  Fault tolerant high performance computing by a coding approach , 2005, PPoPP.

[13]  Y. Charlie Hu,et al.  A Self-Organizing Flock of Condors , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[14]  Satoshi Matsuoka,et al.  NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[15]  Miron Livny,et al.  Condor: a distributed job scheduler , 2001 .

[16]  Hai Jiang,et al.  Preemption of a CUDA Kernel Function , 2012, 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing.

[17]  Ron Brightwell,et al.  Abstract: Comparing GPU and Increment-Based Checkpoint Compression , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[18]  Hai Jiang,et al.  State-Carrying Code for Computation Mobility , 2010 .

[19]  John Paul Walters,et al.  Application-Level Checkpointing Techniques for Parallel Programs , 2006, ICDCIT.

[20]  Andrew Lumsdaine,et al.  Interconnect agnostic checkpoint/restart in open MPI , 2009, HPDC '09.

[21]  Hiroaki Kobayashi,et al.  CheCUDA: A Checkpoint/Restart Tool for CUDA Applications , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[22]  Syed Khwaja Naseer,et al.  A kernel integrated task migration infrastructure for clusters of workstations , 2000 .

[23]  Dejan S. Milojicic,et al.  Process migration , 1999, ACM Comput. Surv..

[24]  Jason Nieh,et al.  Transparent Checkpoint-Restart of Multiple Processes on Commodity Operating Systems , 2007, USENIX Annual Technical Conference.

[25]  Satoshi Matsuoka,et al.  Design and modeling of a non-blocking checkpointing system , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  Satoshi Matsuoka,et al.  GPU accelerated computing—from hype to mainstream, the rebirth of vector computing , 2009 .

[27]  Hai Jiang,et al.  A Heuristic Checkpoint Placement Algorithm for Adaptive Application-Level Checkpointing , 2011 .

[28]  Wu-chun Feng,et al.  Transparent Accelerator Migration in a Virtualized GPU Environment , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[29]  Rolf Riesen,et al.  libhashckpt: Hash-Based Incremental Checkpointing Using GPU's , 2011, EuroMPI.