Implementing the Dslash Operator in OpenCL

The Dslash operator is used in Lattice Quantum Chromodymamics (LQCD) applications to implement a Wilson-Dirac sparse matrix-vector product. Typically the Dslash operation has been implemented as a parallel program. Today’s Graphics Processing Units (GPU) are designed to do highly parallel numerical calculations for 3D graphics rendering. This design works well with scientific applications such as LQCD’s implementation of the Dslash operator. The Scientific Computing group at the Thomas Jefferson National Accelerator Facility (Jefferson Lab) has implemented the Dslash operator for execution on GPUs using NVIDIA’s Compute Unified Device Architecture (CUDA). CUDA applications, however, will only run on NVIDIA hardware. OpenCL (Open Computing Language) is a new open standard for developing parallel programs across CPUs, GPUs and other processors. This paper describes the implementation of the Dslash operator using OpenCL (Open Computing Language), its performance on NVIDIA GPUs compared with CUDA, and its performance on other hardware platforms. General Terms Performance, Languages.

[1]  Kipton Barros,et al.  Solving lattice QCD systems of equations using mixed precision solvers on GPUs , 2009, Comput. Phys. Commun..

[2]  Sinéad M. Ryan,et al.  Practical all-to-all propagators for lattice QCD , 2005, Comput. Phys. Commun..

[3]  Xipeng Shen,et al.  A cross-input adaptive framework for GPU program optimizations , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[4]  Khaled Z. Ibrahim,et al.  Fine-grained parallelization of lattice QCD kernel routine on GPUs , 2008, J. Parallel Distributed Comput..

[5]  Gottlieb,et al.  Hybrid-molecular-dynamics algorithms for the numerical simulation of quantum chromodynamics. , 1987, Physical review. D, Particles and fields.

[6]  Uday Bondhugula,et al.  A compiler framework for optimization of affine loop nests for gpgpus , 2008, ICS '08.

[7]  Zoltán Fodor,et al.  Lattice QCD as a video game , 2007, Comput. Phys. Commun..

[8]  Naga K. Govindaraju,et al.  Fast scan algorithms on graphics processors , 2008, ICS '08.

[9]  Ofer Rosenberg OpenCL parallel computing for heterogeneous devices , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[10]  S. Duane,et al.  Hybrid Monte Carlo , 1987 .