Long-Term Reliability Management For Multitasking GPGPUs

This paper proposes long-term reliability management for spatial multitasking GPU architectures. Specifically, we focus on electromigration (EM)-induced long-term failure of the GPU's power delivery network. A distributed power delivery network model at functional unit granularity is developed and used for our EM analysis of GPU architectures. We use a recently proposed physics-based EM reliability model and consider the EM-induced time-to-failure at the GPU system level as a reliability resource. For GPU scheduling, we mainly focus on spatial multitasking, which allows GPU computing resources to be partitioned among multiple applications. We find that the existing reliability-agnostic thread block scheduler for spatial multitasking is effective in achieving high GPU utilization, but poor reliability. We develop and implement a long-term reliability-aware thread block scheduler in GPGPU-Sim, and compare it against existing reliability-agnostic scheduler. We evaluate several use cases of spatial multitasking and find that our proposed scheduler achieves up to 30% improvement in long-term reliability.

[1]  Jörg Henkel,et al.  Recent advances in EM and BTI induced reliability modeling, analysis and optimization (invited) , 2018, Integr..

[2]  Saurabh Gupta,et al.  Reliability lessons learned from GPU experience with the Titan supercomputer at Oak Ridge leadership computing facility , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  Nam Sung Kim,et al.  The case for GPGPU spatial multitasking , 2012, IEEE International Symposium on High-Performance Comp Architecture.

[4]  Sheldon X.-D. Tan,et al.  EM-based on-chip aging sensor for detection and prevention of counterfeit and recycled ICs , 2015, 2015 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[5]  Rami G. Melhem,et al.  Simultaneous Multikernel GPU: Multi-tasking throughput processors via fine-grained sharing , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[6]  Kevin Skadron,et al.  Interconnect lifetime prediction under dynamic stress for reliability-aware design , 2004, IEEE/ACM International Conference on Computer Aided Design, 2004. ICCAD-2004..