Tuning And Understanding MILC Performance In Cray XK 6 GPU Clusters

Graphics Processing Units (GPU) are becoming increasingly popular in high performance computing due to their high performance, high power efficiency, and low cost. Lattice QCD is one of the fields that has successfully adopted GPUs and scaled to hundreds of them. In this paper, we report our Cray XK6 experience in profiling and understanding performance for MILC, one of the Lattice QCD computation packages, running on multi-node Cray XK6 computers using a domain specific GPU library called QUDA. QUDA is a library for accelerating Lattice QCD computations on GPUs. It started at Boston University and has evolved into a multi-institution project. It supports multiple quark actions and has been interfaced to many applications, including MILC and Chroma. The most time consuming part of lattice QCD computation is a sparse matrix solver and QUDA supports efficient Conjugate Gradient (CG) and other solvers. By partitioning in the 4-D space time domain, the solvers in the QUDA library enable the applications to scale to hundreds of GPUs with high efficiency. The other computationally intensive components, such as link fattening, gauge force and fermion force computations, are also being ported to GPUs.

[1]  Weonjong Lee,et al.  Multi GPU Performance of Conjugate Gradient Algorithm with Staggered Fermions , 2010 .

[2]  Kipton Barros,et al.  Solving lattice QCD systems of equations using mixed precision solvers on GPUs , 2009, Comput. Phys. Commun..

[3]  Yao-Yuan Mao,et al.  GPU-Based Conjugate Gradient Solver for Lattice QCD with Domain-Wall Fermions , 2010 .

[4]  M. A. Clark QCD on GPUs: cost effective supercomputing , 2009 .

[5]  Zoltán Fodor,et al.  Lattice QCD as a video game , 2007, Comput. Phys. Commun..

[6]  Volodymyr V. Kindratenko,et al.  Design of MILC Lattice QCD Application for GPU Clusters , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[7]  Craig Pelissier,et al.  Multi-mass solvers for lattice QCD on GPUs , 2011, J. Comput. Phys..

[8]  C. DeTar,et al.  Electromagnetic splitting of charged and neutral mesons , 2010 .

[9]  Steven A. Gottlieb,et al.  Scaling lattice QCD beyond 100 GPUs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Kipton Barros,et al.  Blasting through lattice calculations using CUDA , 2008, 0810.5365.

[11]  R. Sommer,et al.  An 8 parameter representation of SU(3) matrices and its application for simulating lattice qcd , 1986 .

[12]  Craig Pelissier,et al.  Efficient Implementation of the Overlap Operator on Multi-GPUs , 2011, 2011 Symposium on Application Accelerators in High-Performance Computing.

[13]  B. Jegerlehner Multiple mass solvers , 1998 .

[14]  Bálint Joó,et al.  Parallelizing the QUDA Library for Multi-GPU Calculations in Lattice Quantum Chromodynamics , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.