On the Trend of Resilience for GPU-Dense Systems
暂无分享,去创建一个
Mattan Erez | Stephen W. Keckler | Michael B. Sullivan | Timothy Tsai | Siva Kumar Sastry Hari | Kyushick Lee
[1] Gabriel H. Loh,et al. Leveraging near data processing for high-performance checkpoint/restart , 2017, SC.
[2] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[3] Ravishankar K. Iyer,et al. Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.
[4] Yong Chen,et al. Towards scalable I/O architecture for exascale systems , 2011, MTAGS '11.
[5] Luigi Carro,et al. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).
[6] Hiroaki Kobayashi,et al. CheCUDA: A Checkpoint/Restart Tool for CUDA Applications , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.
[7] Franck Cappello,et al. Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.
[8] Bronis R. de Supinski,et al. The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.
[9] Robert B. Ross,et al. On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).
[10] Michael Sullivan,et al. CRUM: Checkpoint-Restart Support for CUDA's Unified Memory , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).
[11] Satoshi Matsuoka,et al. NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.