Enabling Efficient Use of MPI and PGAS Programming Models on Heterogeneous Clusters with High Performance Interconnects
暂无分享,去创建一个
[1] J.C. Sancho,et al. Quantifying the Potential Benefit of Overlapping Communication and Computation in Large-Scale Scientific Applications , 2006, ACM/IEEE SC 2006 Conference (SC'06).
[2] Dhabaleswar K. Panda,et al. Design of a scalable InfiniBand topology service to enable network-topology-aware placement of processes , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.
[3] Jing Zhu,et al. Enabling Very-Large Scale Earthquake Simulations on Parallel Machines , 2007, International Conference on Computational Science.
[4] Amith R. Mamidala,et al. Lock-Free Asynchronous Rendezvous Design for MPI Point-to-Point Communication , 2008, PVM/MPI.
[5] Bernd Mohr,et al. Scalable detection of MPI-2 remote memory access inefficiency patterns , 2009, Int. J. High Perform. Comput. Appl..
[6] K. Milfeld,et al. Early experiences with the intel many integrated cores accelerated computing technology , 2011 .
[7] Sayantan Sur,et al. Designing truly one-sided MPI-2 RMA intra-node communication on multi-core systems , 2010, Computer Science - Research and Development.
[8] Sayantan Sur,et al. Lightweight kernel-level primitives for high-performance MPI intra-node communication over multi-core systems , 2007, 2007 IEEE International Conference on Cluster Computing.
[9] Sayantan Sur,et al. Quantifying performance benefits of overlap using MPI-2 in a seismic modeling application , 2010, ICS '10.
[10] Sayantan Sur,et al. Multi-threaded UPC runtime with network endpoints: Design alternatives and evaluation on multi-core architectures , 2011, 2011 18th International Conference on High Performance Computing.
[11] Jing Zhu,et al. Toward petascale earthquake simulations , 2009 .
[12] John D. Owens,et al. Message passing on data-parallel architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.
[13] Sriram Krishnamoorthy,et al. Acceleration of Streamed Tensor Contraction Expressions on GPGPU-Based Clusters , 2010, 2010 IEEE International Conference on Cluster Computing.
[14] Satoshi Matsuoka,et al. Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[15] Richard W. Vuduc,et al. Model-driven autotuning of sparse matrix-vector multiply on GPUs , 2010, PPoPP '10.
[16] Lei Huang,et al. Unified Parallel C for GPU Clusters: Language Extensions and Compiler Implementation , 2010, LCPC.
[17] Dhabaleswar K. Panda,et al. Design and implementation of MPICH2 over InfiniBand with RDMA support , 2003, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..
[18] Feng Ji,et al. Efficient Intranode Communication in GPU-Accelerated Systems , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.
[19] Mitsuhisa Sato,et al. An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters , 2011, Euro-Par Workshops.
[20] Inanc Senocak,et al. An MPI-CUDA Implementation for Massively Parallel Incompressible Flow Computations on Multi-GPU Clusters , 2010 .
[21] Dhabaleswar K. Panda,et al. Impact of on-demand connection management in MPI over VIA , 2002, Proceedings. IEEE International Conference on Cluster Computing.
[22] Hyun-Wook Jin,et al. High performance MPI-2 one-sided communication over InfiniBand , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..
[23] Hyun-Wook Jin,et al. Scheduling of MPI-2 one sided operations over InfiniBand , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.
[24] Yutaka Ishikawa,et al. Abstract: An MPI Library implementing Direct Communication for Many-Core Based Accelerators , 2012, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis.
[25] Satoshi Matsuoka,et al. Physis: An implicitly parallel programming model for stencil computations on large-scale GPU-accelerated supercomputers , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[26] Nagiza F. Samatova,et al. Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments , 2012, 2012 IEEE International Conference on Cluster Computing.
[27] Carlos Rosales,et al. Multiphase LBM Distributed over Multiple GPUs , 2011, 2011 IEEE International Conference on Cluster Computing.
[28] Sayantan Sur,et al. RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits , 2006, PPoPP '06.
[29] Dmitry Pekurovsky,et al. P3DFFT: A Framework for Parallel Computations of Fourier Transforms in Three Dimensions , 2012, SIAM J. Sci. Comput..
[30] Federico Silla,et al. Enabling CUDA acceleration within virtual machines using rCUDA , 2011, 2011 18th International Conference on High Performance Computing.
[31] Sayantan Sur,et al. Optimizing MPI One Sided Communication on Multi-core InfiniBand Clusters Using Shared Memory Backed Windows , 2011, EuroMPI.
[32] Dhabaleswar K. Panda,et al. Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs , 2013, 2013 42nd International Conference on Parallel Processing.
[33] Charles L. Seitz,et al. Myrinet: A Gigabit-per-Second Local Area Network , 1995, IEEE Micro.
[34] Larry Meadows,et al. Experiments with WRF on Intel® Many Integrated Core (Intel MIC) Architecture , 2012, IWOMP.
[35] John E. Stone,et al. An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS XV.
[36] K. Gopalakrishnan,et al. Natively Supporting True One-Sided Communication in MPI on Multi-core Systems with InfiniBand , 2009, 2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid.
[37] S. Rixner,et al. An Event-driven Architecture for MPI Libraries , 2004 .
[38] Sayantan Sur,et al. MPI Alltoall Personalized Exchange on GPGPU Clusters: Design Alternatives and Benefit , 2011, 2011 IEEE International Conference on Cluster Computing.
[39] Scott B. Baden,et al. Mint: realizing CUDA performance in 3D stencil methods with annotated C , 2011, ICS '11.
[40] Dhabaleswar K. Panda,et al. Designing passive synchronization for MPI-2 one-sided communication to maximize overlap , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[41] Dhabaleswar K. Panda,et al. Scalable Earthquake Simulation on Petascale Supercomputers , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.
[42] Wu-chun Feng,et al. VOCL: An optimized environment for transparent virtualization of graphics processing units , 2012, 2012 Innovative Parallel Computing (InPar).
[43] Raymond Namyst,et al. A multithreaded communication engine for multicore architectures , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.
[44] Keith D. Underwood,et al. An analysis of the impact of MPI overlap and independent progress , 2004, ICS '04.
[45] Feng Qiu,et al. Zippy: A Framework for Computation and Visualization on a GPU Cluster , 2008, Comput. Graph. Forum.
[46] Katherine A. Yelick,et al. Optimizing bandwidth limited problems using one-sided communication and overlap , 2005, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.
[47] Reagan Moore,et al. Optimization and Scalability of an Large-scale Earthquake Simulation Application , 2006 .
[48] Pradeep Dubey,et al. Designing and dynamically load balancing hybrid LU for multi/many-core , 2011, Computer Science - Research and Development.
[49] Guillaume Mercier,et al. Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis , 2009, 2009 International Conference on Parallel Processing.
[50] Sabela Ramos,et al. Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi , 2013, HPDC.
[51] P. Maechling,et al. Strong shaking in Los Angeles expected from southern San Andreas earthquake , 2006 .
[52] Massimiliano Fatica,et al. Implementing the Himeno benchmark with CUDA on GPU clusters , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).
[53] Feng Ji,et al. DMA-Assisted, Intranode Communication in GPU Accelerated Systems , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.
[54] Torsten Hoefler,et al. Enabling highly-scalable remote memory access programming with MPI-3 one sided , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).
[55] K. PandaDhabaleswar,et al. The MVAPICH Project: Evolution and Sustainability of an Open Source Production Quality MPI Library for HPC , 2013 .
[56] Jeffrey S. Vetter,et al. Quantifying NUMA and contention effects in multi-GPU systems , 2011, GPGPU-4.
[57] Arthur A. Mirin,et al. A Scalable Implementation of a Finite-Volume Dynamical Core in the Community Atmosphere Model , 2005, Int. J. High Perform. Comput. Appl..