论文信息 - Towards constructing application-level GPU computation states

Towards constructing application-level GPU computation states

Computation state construction is an indispensable step to achieve fault tolerance and computation mobility for scientific applications by saving and restoring the state during program execution. However, there is no effective state construction scheme yet due to the GPU's batch-mode execution manner as the GPU takes on a larger role in high performance computing. The GPU's complex memory hierarchy means the states are scattered in different memory locations that are difficult to fetch. Programs that are running in parallel make the states difficult to construct for each thread. The paper proposes an application-level computation state construction scheme to support GPU programs. A precompiler and run-time support module are developed to construct and save states in the CPU system memory dynamically. Memory blocks are registered, and new data structures are proposed to save and restore the computation states represented by variables and pointers in the GPU. Secondary storage can be utilized for scalability and long-term fault tolerance.

Hai Jiang | Kuan-Ching Li | Yulu Zhang | Xinyuan Guo

[1] Jie Cheng,et al. CUDA by Example: An Introduction to General-Purpose GPU Programming , 2010, Scalable Comput. Pract. Exp..

[2] M. Bozyigit,et al. User-level process checkpoint and restore for migration , 2001, OPSR.

[3] Satoshi Matsuoka,et al. NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[4] Hiroaki Kobayashi,et al. CheCUDA: A Checkpoint/Restart Tool for CUDA Applications , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[5] Syed Khwaja Naseer,et al. A kernel integrated task migration infrastructure for clusters of workstations , 2000 .

[6] Hai Jiang,et al. Preemption of a CUDA Kernel Function , 2012, 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing.

[7] Hai Jiang,et al. State-Carrying Code for Computation Mobility , 2010 .

[8] Satoshi Matsuoka,et al. GPU accelerated computing—from hype to mainstream, the rebirth of vector computing , 2009 .