Analysis of NVMe-SSD to passthrough GPU data transfer in virtualized systems

Non-volatile storage (NVM) technologies provide faster data access compared to traditional hard disk drives and can benefit applications executing on accelerators like general purpose graphics processing units (GPGPUs). Many contemporary GPU-friendly applications process huge volumes of data residing in the secondary storage. Several research work propose techniques to optimize data transfer overheads between devices connected to the same bus e.g., peer-to-peer data transfer between NVMe-SSD and GPU connected to a PCI bus. The applicability of these techniques, extent of their benefit and associated costs in virtualized systems is the scope of this paper. In this paper, we present a comprehensive empirical analysis of different combinations of NVMe-SSD virtualization techniques and data transfer mechanisms between NVMe-SSDs and GPUs. Further, the impact of different data transfer parameters and, root-cause analysis of the resulting performance in terms of data transfer throughput and CPU utilization for different combinations of techniques is presented. Based on the empirical analysis, we provide insights to address several bottlenecks related to different GPU data transfer techniques in different virtualization setups and motivate an alternate design by extending the VirtIO framework for efficient peer-to-peer data transfer.

[1]  Shinpei Kato,et al.  Data Transfer Matters for GPU Computing , 2013, 2013 International Conference on Parallel and Distributed Systems.

[2]  Rusty Russell,et al.  virtio: towards a de-facto standard for virtual I/O devices , 2008, OPSR.

[3]  Xiaowei Yang,et al.  High performance network virtualization with SR-IOV , 2010, HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture.

[4]  John Shalf,et al.  NVMMU: A Non-volatile Memory Management Unit for Heterogeneous GPU-SSD Architectures , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[5]  Christopher J. Rossbach,et al.  AvA: Accelerated Virtualization of Accelerators , 2020, ASPLOS.

[6]  Yaozu Dong,et al.  A Full GPU Virtualization Solution with Mediated Pass-Through , 2014, USENIX Annual Technical Conference.

[7]  Rüdiger Westermann,et al.  A survey of medical image registration on graphics hardware , 2011, Comput. Methods Programs Biomed..

[8]  Carlos Reaño,et al.  CU2rCU: Towards the complete rCUDA remote GPU virtualization and sharing solution , 2012, 2012 19th International Conference on High Performance Computing.

[9]  Bo Peng,et al.  MDev-NVMe: A NVMe Storage Virtualization Solution with Mediated Pass-Through , 2018, USENIX Annual Technical Conference.

[10]  -. Qiang,et al.  Graph Processing on GPUs , 2018, ACM Comput. Surv..

[11]  Randy H. Katz,et al.  Above the Clouds: A Berkeley View of Cloud Computing , 2009 .

[12]  Dhabaleswar K. Panda,et al.  High Performance VMM-Bypass I/O in Virtual Machines , 2006, USENIX Annual Technical Conference, General Track.

[13]  Irfan Habib,et al.  Virtualization with KVM , 2008 .

[14]  Luis Angel D. Bathen,et al.  Optimal multistream sequential prefetching in a shared cache , 2007, TOS.

[15]  Bongjae Kim,et al.  A case study of data transfer efficiency optimization for GPU- and infiniband-based clusters , 2015, RACS.

[16]  Purushottam Kulkarni,et al.  Comparative Analysis of Page Cache Provisioning in Virtualized Environments , 2014, 2014 IEEE 22nd International Symposium on Modelling, Analysis & Simulation of Computer and Telecommunication Systems.

[17]  Fabrice Bellard,et al.  QEMU, a Fast and Portable Dynamic Translator , 2005, USENIX ATC, FREENIX Track.

[18]  Gang Cao,et al.  SPDK Vhost-NVMe: Accelerating I/Os in Virtual Machines on NVMe SSDs via User Space Vhost Target , 2018, 2018 IEEE 8th International Symposium on Cloud and Service Computing (SC2).

[19]  Uday Kurkure,et al.  Empirical Analysis of Hardware-Assisted GPU Virtualization , 2019, 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC).

[20]  Xiangyu Li,et al.  Hetero-mark, a benchmark suite for CPU-GPU collaborative computing , 2016, 2016 IEEE International Symposium on Workload Characterization (IISWC).

[21]  Purushottam Kulkarni,et al.  Vagabond: Dynamic Network Endpoint Reconfiguration in Virtualized Environments , 2014, SoCC.

[22]  Geoffrey C. Fox,et al.  GPU Passthrough Performance: A Comparison of KVM, Xen, VMWare ESXi, and LXC for CUDA and OpenCL Applications , 2014, 2014 IEEE 7th International Conference on Cloud Computing.

[23]  Carl A. Waldspurger,et al.  Memory resource management in VMware ESX server , 2002, OSDI '02.