A Checkpoint/Restart Scheme for CUDA Programs with Complex Computation States
暂无分享,去创建一个
[1] Daniel Marques,et al. Automated application-level checkpointing of MPI programs , 2003, PPoPP '03.
[2] Jason Duell,et al. Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters , 2006 .
[3] Song Jiang,et al. Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[4] Bran Selic,et al. A Fault Tolerance Framework for High Performance Computing in Cloud , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).
[5] M. Bozyigit,et al. User-level process checkpoint and restore for migration , 2001, OPSR.
[6] Jie Cheng,et al. Programming Massively Parallel Processors. A Hands-on Approach , 2010, Scalable Comput. Pract. Exp..
[7] Tom Davis,et al. Opengl programming guide: the official guide to learning opengl , 1993 .
[8] Marjan Mernik,et al. A technique for non-invasive application-level checkpointing , 2011, The Journal of Supercomputing.
[9] Allen Sherrod,et al. Beginning DirectX 11 Game Programming , 2011 .
[10] Jason Sanders,et al. CUDA by example: an introduction to general purpose GPU programming , 2010 .
[11] Wenguang Chen,et al. CprFS: a user-level file system to support consistent file states for checkpoint and restart , 2008, ICS '08.
[12] George Bosilca,et al. Fault tolerant high performance computing by a coding approach , 2005, PPoPP.
[13] Y. Charlie Hu,et al. A Self-Organizing Flock of Condors , 2003, ACM/IEEE SC 2003 Conference (SC'03).
[14] Satoshi Matsuoka,et al. NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.
[15] Miron Livny,et al. Condor: a distributed job scheduler , 2001 .
[16] Hai Jiang,et al. Preemption of a CUDA Kernel Function , 2012, 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing.
[17] Ron Brightwell,et al. Abstract: Comparing GPU and Increment-Based Checkpoint Compression , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.
[18] Hai Jiang,et al. State-Carrying Code for Computation Mobility , 2010 .
[19] John Paul Walters,et al. Application-Level Checkpointing Techniques for Parallel Programs , 2006, ICDCIT.
[20] Andrew Lumsdaine,et al. Interconnect agnostic checkpoint/restart in open MPI , 2009, HPDC '09.
[21] Hiroaki Kobayashi,et al. CheCUDA: A Checkpoint/Restart Tool for CUDA Applications , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.
[22] Syed Khwaja Naseer,et al. A kernel integrated task migration infrastructure for clusters of workstations , 2000 .
[23] Dejan S. Milojicic,et al. Process migration , 1999, ACM Comput. Surv..
[24] Jason Nieh,et al. Transparent Checkpoint-Restart of Multiple Processes on Commodity Operating Systems , 2007, USENIX Annual Technical Conference.
[25] Satoshi Matsuoka,et al. Design and modeling of a non-blocking checkpointing system , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[26] Satoshi Matsuoka,et al. GPU accelerated computing—from hype to mainstream, the rebirth of vector computing , 2009 .
[27] Hai Jiang,et al. A Heuristic Checkpoint Placement Algorithm for Adaptive Application-Level Checkpointing , 2011 .
[28] Wu-chun Feng,et al. Transparent Accelerator Migration in a Virtualized GPU Environment , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).
[29] Rolf Riesen,et al. libhashckpt: Hash-Based Incremental Checkpointing Using GPU's , 2011, EuroMPI.