论文信息 - On the Trend of Resilience for GPU-Dense Systems

On the Trend of Resilience for GPU-Dense Systems

Emerging high-performance computing (HPC) systems show a tendency towards heterogeneous nodes that are dense with accelerators such as GPUs. They offer higher computational power at lower energy and cost than homogeneous CPU-only nodes. While an accelerator-rich machine reduces the total number of compute nodes required to achieve a performance target, a single node becomes susceptible to accelerator failures as well as sharing intra-node resources with many accelerators. Such failures must be recovered by end-to-end resilience schemes such as checkpoint-restart. However, preserving a large amount of local state within accelerators for checkpointing incurs significant overhead. This trend reveals a new challenge for the resilience in accelerator-dense systems. We study its impact in multi-level checkpointing systems and with burst buffers. We quantify the system-level efficiency for resilience, sweeping the failure rate, system scale, and GPU density. Our multi-level checkpoint-restart model shows that the efficiency begins to drop at a 16:1 GPU-to-CPU ratio in a 3.6 EFLOP system and a ratio of 64:1 degrades overall system efficiency by 5%. Furthermore, we quantify the system-level impact of possible design considerations for the resilience in GPU-dense systems to mitigate this challenge.

[1] Gabriel H. Loh,et al. Leveraging near data processing for high-performance checkpoint/restart , 2017, SC.

[2] Bronis R. de Supinski,et al. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[3] Ravishankar K. Iyer,et al. Lessons Learned from the Analysis of System Failures at Petascale: The Case of Blue Waters , 2014, 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks.

[4] Yong Chen,et al. Towards scalable I/O architecture for exascale systems , 2011, MTAGS '11.

[5] Luigi Carro,et al. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[6] Hiroaki Kobayashi,et al. CheCUDA: A Checkpoint/Restart Tool for CUDA Applications , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[7] Franck Cappello,et al. Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[8] Bronis R. de Supinski,et al. The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[9] Robert B. Ross,et al. On the role of burst buffers in leadership-class storage systems , 2012, 012 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST).

[10] Michael Sullivan,et al. CRUM: Checkpoint-Restart Support for CUDA's Unified Memory , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[11] Satoshi Matsuoka,et al. NVCR: A Transparent Checkpoint-Restart Library for NVIDIA CUDA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.