Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast
暂无分享,去创建一个
Dhabaleswar K. Panda | Xiaoyi Lu | Hari Subramoni | Ching-Hsiang Chu | Bracy Elton | Ammar A. Awan | D. Panda | Ching-Hsiang Chu | Xiaoyi Lu | H. Subramoni | B. Elton | A. Awan
[1] Zizhong Chen,et al. Performance of MPI broadcast algorithms , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[2] Amith R. Mamidala,et al. Fast and scalable MPI-level broadcast using InfiniBand's hardware multicast support , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[3] Dhabaleswar K. Panda,et al. Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters , 2016, 2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).
[4] Satoshi Matsuoka,et al. High-Performance MPI Broadcast Algorithm for Grid Environments Utilizing Multi-lane NICs , 2007, Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid '07).
[5] Michael S. Bernstein,et al. ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.
[6] Jack J. Dongarra,et al. Performance analysis of MPI collective operations , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.
[7] Jungwon Kim,et al. Accelerating LINPACK with MPI-OpenCL on Clusters of Multi-GPU Nodes , 2015, IEEE Transactions on Parallel and Distributed Systems.
[8] Dhabaleswar K. Panda,et al. Efficient and truly passive MPI-3 RMA using InfiniBand atomics , 2013, EuroMPI.
[9] Torsten Hoefler,et al. Enabling highly-scalable remote memory access programming with MPI-3 one sided , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[10] Dhabaleswar K. Panda,et al. Efficient Large Message Broadcast using NCCL and CUDA-Aware MPI for Deep Learning , 2016, EuroMPI.
[11] Scott B. Baden,et al. Effective multi-GPU communication using multiple CUDA streams and threads , 2014, 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS).
[12] Dhabaleswar K. Panda,et al. Optimizing MPI Communication on Multi-GPU Systems Using CUDA Inter-Process Communication , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.
[13] Erwin Laure,et al. A data streaming model in MPI , 2015, ExaMPI '15.
[14] Jian-Ming Jin,et al. Acceleration of the Dual-Field Domain Decomposition Algorithm Using MPI–CUDA on Large-Scale Computing Systems , 2014, IEEE Transactions on Antennas and Propagation.
[15] Dhabaleswar K. Panda,et al. Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters , 2014, 2014 21st International Conference on High Performance Computing (HiPC).
[16] Dhabaleswar K. Panda,et al. A high performance broadcast design with hardware multicast and GPUDirect RDMA for streaming applications on Infiniband clusters , 2014, 2014 21st International Conference on High Performance Computing (HiPC).
[17] Dhabaleswar K. Panda,et al. Efficient and Scalable Multi-Source Streaming Broadcast on GPU Clusters for Deep Learning , 2017, 2017 46th International Conference on Parallel Processing (ICPP).
[18] Dhabaleswar K. Panda,et al. Optimized Broadcast for Deep Learning Workloads on Dense-GPU InfiniBand Clusters: MPI or NCCL? , 2017, EuroMPI.
[19] Rafael Asenjo,et al. Mapping Streaming Applications on Commodity Multi-CPU and GPU On-Chip Processors , 2016, IEEE Transactions on Parallel and Distributed Systems.
[20] Dhabaleswar K. Panda,et al. Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs , 2013, 2013 42nd International Conference on Parallel Processing.
[21] Amith R. Mamidala,et al. Efficient SMP-aware MPI-level broadcast over InfiniBand's hardware multicast , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.
[22] Jiazheng Zhou,et al. Hardware supported multicast in fat-tree-based InfiniBand networks , 2007, The Journal of Supercomputing.
[23] Dhabaleswar K. Panda,et al. Re-Designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters , 2016, 2016 IEEE International Conference on Cloud Computing Technology and Science (CloudCom).
[24] Alexey L. Lastovetsky,et al. High-Level Topology-Oblivious Optimization of MPI Broadcast Algorithms on Extreme-Scale Platforms , 2014, Euro-Par Workshops.
[25] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[26] Umakishore Ramachandran,et al. Streamline: a scheduling heuristic for streaming applications on the grid , 2006, Electronic Imaging.
[27] Martín Abadi,et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.
[28] Vladimir Marjanovic,et al. A Bandwidth-Saving Optimization for MPI Broadcast Collective Operation , 2015, 2015 44th International Conference on Parallel Processing Workshops.
[29] Torsten Hoefler,et al. A practically constant-time MPI Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.
[30] Dhabaleswar K. Panda,et al. OMB-GPU: A Micro-Benchmark Suite for Evaluating MPI Libraries on GPU Clusters , 2012, EuroMPI.
[31] Geoffrey E. Hinton,et al. ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.
[32] Dhabaleswar K. Panda,et al. S-Caffe: Co-designing MPI Runtimes and Caffe for Scalable Deep Learning on Modern GPU Clusters , 2017, PPoPP.
[33] Dhabaleswar K. Panda,et al. Efficient Reliability Support for Hardware Multicast-Based Broadcast in GPU-enabled Streaming Applications , 2016, 2016 First International Workshop on Communication Optimizations in HPC (COMHPC).
[34] Andrew Zisserman,et al. Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.