Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast

Broadcast is a widely used operation in many streaming and deep learning applications to disseminate large amounts of data on emerging heterogeneous High-Performance Computing (HPC) systems. However, traditional broadcast schemes do not fully utilize hardware features for Graphics Processing Unit (GPU)-based applications. In this paper, a model-oriented analysis is presented to identify performance bottlenecks of existing broadcast schemes on GPU clusters. Next, streaming-based broadcast schemes are proposed to exploit InfiniBand hardware multicast (IB-MCAST) and NVIDIA GPUDirect technology for efficient message transmission. The proposed designs are evaluated in the context of using Message Passing Interface (MPI) based benchmarks and applications. The experimental results indicate improved scalability and up to 82 percent reduction of latency compared to the state-of-the-art solutions in the benchmark-level evaluation. Furthermore, compared to the state-of-the-art, the proposed design yields stable higher throughput for a synthetic streaming workload, and 1.3x faster training time for a deep learning framework.

[1]  Zizhong Chen,et al.  Performance of MPI broadcast algorithms , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[2]  Amith R. Mamidala,et al.  Fast and scalable MPI-level broadcast using InfiniBand's hardware multicast support , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[3]  Dhabaleswar K. Panda,et al.  Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters , 2016, 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[4]  Satoshi Matsuoka,et al.  High-Performance MPI Broadcast Algorithm for Grid Environments Utilizing Multi-lane NICs , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).

[5]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[6]  Jack J. Dongarra,et al.  Performance analysis of MPI collective operations , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[7]  Jungwon Kim,et al.  Accelerating LINPACK with MPI-OpenCL on Clusters of Multi-GPU Nodes , 2015, IEEE Transactions on Parallel and Distributed Systems.

[8]  Dhabaleswar K. Panda,et al.  Efficient and truly passive MPI-3 RMA using InfiniBand atomics , 2013, EuroMPI.

[9]  Torsten Hoefler,et al.  Enabling highly-scalable remote memory access programming with MPI-3 one sided , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Dhabaleswar K. Panda,et al.  Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning , 2016, EuroMPI.

[11]  Scott B. Baden,et al.  Effective multi-GPU communication using multiple CUDA streams and threads , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).

[12]  Dhabaleswar K. Panda,et al.  Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[13]  Erwin Laure,et al.  A data streaming model in MPI , 2015, ExaMPI '15.

[14]  Jian-Ming Jin,et al.  Acceleration of the Dual-Field Domain Decomposition Algorithm Using MPI–CUDA on Large-Scale Computing Systems , 2014, IEEE Transactions on Antennas and Propagation.

[15]  Dhabaleswar K. Panda,et al.  Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[16]  Dhabaleswar K. Panda,et al.  A high performance broadcast design with hardware multicast and GPUDirect RDMA for streaming applications on Infiniband clusters , 2014, 2014 21st International Conference on High Performance Computing (HiPC).

[17]  Dhabaleswar K. Panda,et al.  Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning , 2017, 2017 46th International Conference on Parallel Processing (ICPP).

[18]  Dhabaleswar K. Panda,et al.  Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL? , 2017, EuroMPI.

[19]  Rafael Asenjo,et al.  Mapping Streaming Applications on Commodity Multi-CPU and GPU On-Chip Processors , 2016, IEEE Transactions on Parallel and Distributed Systems.

[20]  Dhabaleswar K. Panda,et al.  Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs , 2013, 2013 42nd International Conference on Parallel Processing.

[21]  Amith R. Mamidala,et al.  Efficient SMP-aware MPI-level broadcast over InfiniBand's hardware multicast , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[22]  Jiazheng Zhou,et al.  Hardware supported multicast in fat-tree-based InfiniBand networks , 2007, The Journal of Supercomputing.

[23]  Dhabaleswar K. Panda,et al.  Re-Designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters , 2016, 2016 IEEE International Conference on Cloud Computing Technology and Science (CloudCom).

[24]  Alexey L. Lastovetsky,et al.  High-Level Topology-Oblivious Optimization of MPI Broadcast Algorithms on Extreme-Scale Platforms , 2014, Euro-Par Workshops.

[25]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[26]  Umakishore Ramachandran,et al.  Streamline: a scheduling heuristic for streaming applications on the grid , 2006, Electronic Imaging.

[27]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[28]  Vladimir Marjanovic,et al.  A Bandwidth-Saving Optimization for MPI Broadcast Collective Operation , 2015, 2015 44th International Conference on Parallel Processing Workshops.

[29]  Torsten Hoefler,et al.  A practically constant-time MPI Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[30]  Dhabaleswar K. Panda,et al.  OMB-GPU: A Micro-Benchmark Suite for Evaluating MPI Libraries on GPU Clusters , 2012, EuroMPI.

[31]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[32]  Dhabaleswar K. Panda,et al.  S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters , 2017, PPoPP.

[33]  Dhabaleswar K. Panda,et al.  Efficient Reliability Support for Hardware Multicast-Based Broadcast in GPU-enabled Streaming Applications , 2016, 2016 First International Workshop on Communication Optimizations in HPC (COMHPC).

[34]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.