GPUDirect Async: Exploring GPU synchronous communication techniques for InfiniBand clusters

Abstract NVIDIA GPUDirect is a family of technologies aimed at optimizing data movement among GPUs (P2P) or among GPUs and third-party devices (RDMA). GPUDirect Async, introduced in CUDA 8.0, is a new addition which allows direct synchronization between GPU and third party devices. For example, Async allows an NVIDIA GPU to directly trigger and poll for completion of communication operations queued to an InfiniBand Connect-IB network adapter, with no involvement of CPU in the critical communication path of GPU applications. In this paper we describe the motivations and the building blocks of GPUDirect Async. After an initial analysis with a micro-benchmark, by means of a performance model, we show the potential benefits of using two different asynchronous communication models supported by this new technology in two MPI multi-GPU applications: HPGMG-FV, a proxy for real-world geometric multi-grid applications and CoMD-CUDA, a proxy for Classical Molecular Dynamics codes. We also report a test case in which the use of GPUDirect Async does not provide any advantage, that is an implementation of the Breadth First Search algorithm for large scale graphs.