Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics

Graphics Processing Units (GPUs) are having a transformational effect on numerical lattice quantum chromo- dynamics (LQCD) calculations of importance in nuclear and particle physics. The QUDA library provides a package of mixed precision sparse matrix linear solvers for LQCD applications, supporting single GPUs based on NVIDIA's Compute Unified Device Architecture (CUDA). This library, interfaced to the QDP++/Chroma framework for LQCD calculations, is currently in production use on the "9g" cluster at the Jefferson Laboratory, enabling unprecedented price/performance for a range of problems in LQCD. Nevertheless, memory constraints on current GPU devices limit the problem sizes that can be tackled. In this contribution we describe the parallelization of the QUDA library onto multiple GPUs using MPI, including strategies for the overlapping of communication and computation. We report on both weak and strong scaling for up to 32 GPUs interconnected by InfiniBand, on which we sustain in excess of 4 Tflops.

[1]  Volodymyr Kindratenko,et al.  Cell processor implementation of a MILC lattice QCD application , 2008 .

[2]  Michael Lang,et al.  The reverse-acceleration model for programming petascale hybrid systems , 2009, IBM J. Res. Dev..

[3]  Henk A. van der Vorst,et al.  Bi-CGSTAB: A Fast and Smoothly Converging Variant of Bi-CG for the Solution of Nonsymmetric Linear Systems , 1992, SIAM J. Sci. Comput..

[4]  Gerard L. G. Sleijpen,et al.  Reliable updated residuals in hybrid Bi-CG methods , 1996, Computing.

[5]  Robert G. Edwards,et al.  Toward the excited meson spectrum of dynamical QCD , 2010, 1004.4930.

[6]  Kipton Barros,et al.  Blasting through lattice calculations using CUDA , 2008, 0810.5365.

[7]  Kipton Barros,et al.  Solving lattice QCD systems of equations using mixed precision solvers on GPUs , 2009, Comput. Phys. Commun..

[8]  Khaled Z. Ibrahim,et al.  Fine-grained parallelization of lattice QCD kernel routine on GPUs , 2008, J. Parallel Distributed Comput..

[9]  Robert G. Edwards,et al.  The Chroma Software System for Lattice QCD , 2004 .

[10]  Robert G. Edwards,et al.  Novel quark-field creation operator construction for hadronic physics in lattice QCD , 2009, 0905.2160.

[11]  N. Eicker,et al.  QCD on the Cell Broadband Engine , 2007 .

[12]  R C Brower,et al.  Adaptive multigrid algorithm for lattice QCD. , 2007, Physical review letters.

[13]  Claude Gomez,et al.  QPACE - a QCD parallel computer based on Cell processors , 2009, ArXiv.

[14]  John D. Owens,et al.  Message passing on data-parallel architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[15]  B. Sheikholeslami,et al.  Improved continuum limit lattice action for QCD with wilson fermions , 1985 .

[16]  M. Hestenes,et al.  Methods of conjugate gradients for solving linear systems , 1952 .

[17]  Pietro Rossi,et al.  Conditioning techniques for dynamical fermions , 1990 .

[18]  Zoltán Fodor,et al.  Lattice QCD as a video game , 2007, Comput. Phys. Commun..

[19]  Claudio Rebbi,et al.  Strange quark content of the nucleon , 2009 .

[20]  A. Trew,et al.  Performance of a Lattice Quantum Chromodynamics kernel on the Cell processor , 2008, Comput. Phys. Commun..