论文信息 - A method for communication efficient work distributions in stencil operation based applications on heterogeneous clusters

A method for communication efficient work distributions in stencil operation based applications on heterogeneous clusters

In recent years, the use of accelerators in conjunction with CPUs, known as heterogeneous computing, has brought about significant performance increases for scientific applications. One of the best examples of this is Lattice Quantum Chromo-Dynamics (QCD), a stencil operation based simulation. These simulations have a large memory footprint necessitating the use of many graphics processing units (GPUs) in parallel. This requires the use of a heterogeneous cluster with one or more GPUs per node. In order to obtain optimal performance, it is necessary to determine an efficient communication pattern between GPUs on the same node and between nodes. In this paper we present a performance model based method for minimizing the communication time of applications with stencil operations, such as Lattice QCD, on heterogeneous computing systems with a non-blocking Infiniband interconnection network. The proposed method is able to increase the performance of the most computationally intensive kernel of Lattice QCD by 25 percent due to improved overlapping of communication and computation.

Tarek A. El-Ghazawi | Maria Malik | Joseph Schneible | Lubomir Riha | Andrei Alexandru

[1] Zoltán Fodor,et al. Lattice QCD as a video game , 2007, Comput. Phys. Commun..

[2] Kipton Barros,et al. Solving lattice QCD systems of equations using mixed precision solvers on GPUs , 2009, Comput. Phys. Commun..

[3] Ken-Ichi Ishikawa,et al. Domain Decomposition method on GPU cluster , 2010, ArXiv.

[4] Gernot Münster,et al. Quantum Fields on a Lattice: Preface , 1994 .

[5] Tong Liu,et al. The development of Mellanox/NVIDIA GPUDirect over InfiniBand—a new model for GPU to GPU communications , 2011, Computer Science - Research and Development.

[6] Forschungszentrum Juelich,et al. Lattice Gauge Theory - A short Primer , 2000 .

[7] Craig Pelissier,et al. Efficient Implementation of the Overlap Operator on Multi-GPUs , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.