HAND: A Hybrid Approach to Accelerate Non-contiguous Data Movement Using MPI Datatypes on GPU Clusters
暂无分享,去创建一个
[1] Orion S. Lawlor,et al. Message passing for GPGPU clusters: CudaMPI , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.
[2] Torsten Hoefler,et al. Performance Expectations and Guidelines for MPI Derived Datatypes , 2011, EuroMPI.
[3] Dhabaleswar K. Panda,et al. Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs , 2013, 2013 42nd International Conference on Parallel Processing.
[4] Collin McCurdy,et al. The Scalable Heterogeneous Computing (SHOC) benchmark suite , 2010, GPGPU-3.
[5] Sayantan Sur,et al. Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 , 2011, 2011 IEEE International Conference on Cluster Computing.
[6] Nagiza F. Samatova,et al. Processing MPI Derived Datatypes on Noncontiguous GPU-Resident Data , 2014, IEEE Transactions on Parallel and Distributed Systems.
[7] Dhabaleswar K. Panda,et al. A scalable and portable approach to accelerate hybrid HPL on heterogeneous CPU-GPU clusters , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).
[8] Dhabaleswar K. Panda,et al. GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation , 2014, IEEE Transactions on Parallel and Distributed Systems.
[9] Torsten Hoefler,et al. Parallel Zero-Copy Algorithms for Fast Fourier Transform and Conjugate Gradient Using MPI Datatypes , 2010, EuroMPI.
[10] Satoshi Matsuoka,et al. High performance 3-D FFT using multiple CUDA GPUs , 2012, GPGPU-5.
[11] Sayantan Sur,et al. MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters , 2011, Computer Science - Research and Development.
[12] Robert B. Ross,et al. Implementing Fast and Reusable Datatype Processing , 2003, PVM/MPI.
[13] Torsten Hoefler,et al. MPI datatype processing using runtime compilation , 2013, EuroMPI.
[14] Satoshi Matsuoka,et al. Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[15] Torsten Hoefler,et al. Micro-applications for Communication Data Access Patterns and MPI Datatypes , 2012, EuroMPI.
[16] Mauro Bianco,et al. A Generic Library for Stencil Computations , 2012, ArXiv.
[17] Mark J. Harris,et al. Parallel Prefix Sum (Scan) with CUDA , 2011 .