Offloading Communication Control Logic in GPU Accelerated Applications

NVIDIA GPUDirect is a family of technologiesaimed at optimizing data movement among GPUs (P2P) orbetween GPUs and third-party devices (RDMA). GPUDirectAsync, introduced in CUDA 8.0, is a new addition whichallows direct synchronization between GPU and third partydevices. For example, Async allows an NVIDIA GPU to directlytrigger and poll for completion of communication operationsqueued to an InfiniBand Connect-IB network adapter, removingCPU involvement from the critical path in GPU acceleratedapplications. In this paper, we present the building blocks ofGPUDirect Async and explain the supported usage models ofthis new technology. We also present a performance evaluationusing a micro-benchmark and a synthetic stencil benchmark. Finally, we demonstrate the use of Async in a few multi-GPUMPI applications: HPGMG-FV (geometric multi-grid), achievingup to 25% improvement in total execution time, CoMD-CUDA(classical molecular dynamics), reducing communications timesup to 30%, LULESH2-CUDA, achieving an average performanceimprovement of 13% in the total execution time.