Coarse grain computation-communication overlap for efficient application-level checkpointing for GPUs
暂无分享,去创建一个
Arun K. Somani | Lizandro D. Solano-Quinde | Brett M. Bode | Arun Kumar Somani | B. Bode | L. Solano-Quinde
[1] Peter K. Szwed,et al. Application-level checkpointing for shared memory programs , 2004, ASPLOS XI.
[2] John Owens,et al. Streaming architectures and technology trends , 2005, SIGGRAPH Courses.
[3] Vijay S. Pande,et al. Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU , 2009, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.
[4] Kai Li,et al. Libckpt: Transparent Checkpointing under UNIX , 1995, USENIX.
[5] Jacob A. Abraham,et al. Algorithm-Based Fault Tolerance for Matrix Operations , 1984, IEEE Transactions on Computers.
[6] Arun K. Somani,et al. Achieving Robustness and Minimizing Overhead in Parallel Algorithms Through Overlapped Communication/Computation , 2000, The Journal of Supercomputing.
[7] Song Jiang,et al. Current practice and a direction forward in checkpoint/restart implementations for fault tolerance , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.
[8] Hiroaki Kobayashi,et al. CheCUDA: A Checkpoint/Restart Tool for CUDA Applications , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.
[9] Mark J. Harris,et al. Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware , 2007, Graphics Hardware.
[10] Pat Hanrahan,et al. Understanding the efficiency of GPU algorithms for matrix-matrix multiplication , 2004, Graphics Hardware.
[11] Hai Jiang,et al. Compile/run-time support for thread migration , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.
[12] James Demmel,et al. Benchmarking GPUs to tune dense linear algebra , 2008, HiPC 2008.
[13] Song Jiang,et al. Transparent, Incremental Checkpointing at Kernel Level: a Foundation for Fault Tolerance for Parallel Computers , 2005, ACM/IEEE SC 2005 Conference (SC'05).
[14] Miron Livny,et al. Checkpoint and Migration of UNIX Processes in the Condor Distributed Processing System , 1997 .
[15] Eric Roman. A Survey of Checkpoint / Restart Implementations , 2002 .
[16] John Paul Walters,et al. Application-Level Checkpointing Techniques for Parallel Programs , 2006, ICDCIT.
[17] Jens H. Krüger,et al. A Survey of General‐Purpose Computation on Graphics Hardware , 2007, Eurographics.
[18] Rida A. Bazzi,et al. Compiler-assisted heterogeneous checkpointing , 2001, Proceedings 20th IEEE Symposium on Reliable Distributed Systems.