Infiniband-Verbs on GPU: A Case Study of Controlling an Infiniband Network Device from the GPU

Due to their massive parallelism and high performance per watt GPUs gain high popularity in high performance computing and are a strong candidate for future exacscale systems. But communication and data transfer in GPU accelerated systems remain a challenging problem. Since the GPU normally is not able to control a network device, today a hybrid-programming model is preferred, whereby the GPU is used for calculation and the CPU handles the communication. As a result, communication between distributed GPUs suffers from unnecessary overhead, introduced by switching control flow from GPUs to CPUs and vice versa. In this work, we modify user space libraries and device drivers of GPUs and the Infiniband network device in a way to enable the GPU to control an Infiniband network device to independently source and sink communication requests without any involvements of the CPU. Our performance analysis shows the differences to hybrid communication models in detail, in particular that the CPU's advantage in generating work requests outshines the overhead associated with context switching. In other terms, our results show that complex networking protocols like IBVERBS are better handled by CPUs in spite of time penalties due to context switching, since overhead of work request generation cannot be parallelized and is not suitable with the high parallel programming model of GPUs.

[1]  John D. Owens,et al.  GPU-to-CPU Callbacks , 2010, Euro-Par Workshops.

[2]  John D. Owens,et al.  Efficient Synchronization Primitives for GPUs , 2011, ArXiv.

[3]  Dhabaleswar K. Panda,et al.  Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[4]  Daniel Grunewald BQCD with GPI: A case study , 2012, HPCS 2012.

[5]  Sayantan Sur,et al.  MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters , 2011, Computer Science - Research and Development.

[6]  Sadaf R. Alam,et al.  Evaluation of Inter- and Intra-node Data Transfer Efficiencies between GPU Devices and their Impact on Scalable Applications , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[7]  Lena Oden,et al.  GPI2 for GPUs: A PGAS framework for efficient communication in hybrid clusters , 2013, PARCO.

[8]  D. Panda,et al.  Extending OpenSHMEM for GPU Computing , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[9]  Wu-chun Feng,et al.  MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-based Systems , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[10]  Lei Huang,et al.  Unified Parallel C for GPU Clusters: Language Extensions and Compiler Implementation , 2010, LCPC.

[11]  John D. Owens,et al.  Extending MPI to accelerators , 2011, ASBD '11.

[12]  Dhabaleswar K. Panda,et al.  Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs , 2013, 2013 42nd International Conference on Parallel Processing.

[13]  Holger Fröning,et al.  GGAS: Global GPU address spaces for efficient communication in heterogeneous clusters , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[14]  C. Simmendinger,et al.  The GASPI API specification and its implementation GPI 2.0 , 2013 .

[15]  John D. Owens,et al.  Message passing on data-parallel architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[16]  Odysseas I. Pentakalos An Introduction to the InfiniBand Architecture , 2002, Int. CMG Conference.

[17]  Yutaka Ishikawa,et al.  Direct MPI Library for Intel Xeon Phi Co-Processors , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[18]  Rui Machado,et al.  The Fraunhofer virtual machine: a communication library and runtime system based on the RDMA model , 2009, Computer Science - Research and Development.