GPU-Job Migration: The rCUDA Case

Virtualization techniques have been shown to report benefits to data centers and other computing facilities. In this regard, not only virtual machines allow to reduce the size of the computing infrastructure while increasing overall resource utilization, but also virtualizing individual components of computers may provide significant benefits. This is the case, for instance, for the remote GPU virtualization technique, implemented in several frameworks during the recent years. The large degree of flexibility provided by the remote GPU virtualization technique can be further increased by applying the migration mechanism to it, so that the GPU part of applications can be live-migrated to another GPU elsewhere in the cluster during execution time in a transparent way. In this paper we present the implementation of the migration mechanism within the rCUDA remote GPU virtualization middleware. Furthermore, we present a thorough performance analysis of the implementation of the migration mechanism within rCUDA. To that end, we leverage both synthetic and real production applications as well as three different generations of NVIDIA GPUs. Additionally, two different versions of the InfiniBand interconnect are used in this study. Several use cases are provided in order to show the extraordinary benefits that the GPU-job migration mechanism can report to data centers.

[1]  Michael Sullivan,et al.  CRUM: Checkpoint-Restart Support for CUDA's Unified Memory , 2018, 2018 IEEE International Conference on Cluster Computing (CLUSTER).

[2]  Bingsheng He,et al.  gMig: Efficient GPU Live Migration Optimized by Software Dirty Page for Full Virtualization , 2018, VEE.

[3]  Wu-chun Feng,et al.  Transparent Accelerator Migration in a Virtualized GPU Environment , 2012, 2012 12th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (ccgrid 2012).

[4]  Carlos Reaño,et al.  A Performance Comparison of CUDA Remote GPU Virtualization Frameworks , 2015, 2015 IEEE International Conference on Cluster Computing.

[5]  Giulio Giunta,et al.  A GPGPU Transparent Virtualization Component for High Performance Computing Clouds , 2010, Euro-Par.

[6]  Carlos Reaño,et al.  On the support of inter-node P2P GPU memory copies in rCUDA , 2019, J. Parallel Distributed Comput..

[7]  Nikolaos V. Sahinidis,et al.  GPU-BLAST: using graphics processors to accelerate protein sequence alignment , 2010, Bioinform..

[8]  Ting Li,et al.  Hybrid CPU/GPU Checkpoint for GPU-Based Heterogeneous Systems , 2013, ParCo 2013.

[9]  Javier Prades,et al.  Turning GPUs into Floating Devices over the Cluster: The Beauty of GPU Migration , 2017, 2017 46th International Conference on Parallel Processing Workshops (ICPPW).

[10]  Blesson Varghese,et al.  Multi-tenant virtual GPUs for optimising performance of a financial risk application , 2017, J. Parallel Distributed Comput..

[11]  Yongchao Liu,et al.  CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions , 2013, BMC Bioinformatics.

[12]  Sergio Iserte,et al.  On the benefits of the remote GPU virtualization mechanism: The rCUDA case , 2017, Concurr. Comput. Pract. Exp..

[13]  Matt Martineau,et al.  An Evaluation of Emerging Many-Core Parallel Programming Models , 2016, PMAM@PPoPP.

[14]  Jiajun Wang,et al.  gHA: An Efficient and Iterative Checkpointing Mechanism for Virtualized GPUs , 2016, APSys.

[15]  Tetsu Narumi,et al.  DS-CUDA: A Middleware to Use Many GPUs in the Cloud Environment , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.

[16]  Simon McIntosh-Smith,et al.  The Arch Project: Physics Mini-Apps for Algorithmic Exploration and Evaluating Programming Environments on HPC Architectures , 2017, 2017 IEEE International Conference on Cluster Computing (CLUSTER).

[17]  Xiaolong Wu,et al.  Virtualization Technology and its Impact on Computer Hardware Architecture , 2011, 2011 Eighth International Conference on Information Technology: New Generations.

[18]  Carlos Reaño,et al.  Local and Remote GPUs Perform Similar with EDR 100G InfiniBand , 2015, Middleware Industry.