Advanced Application Support for Improved GPU Utilization on Keeneland

With the delivery of the Keeneland Full Scale (KFS) system in 2012, XSEDE gained a new, unique GPU computing resource that contains a large number of GPUs per node. In KFS, each node has three NVIDIA Fermi GPUs, for a total of 792 GPUs and a theoretical peak of 614.5 TFLOPS across 264 nodes. While this system provides the potential for extreme productivity, its unique architecture also requires that each user make full use of all the GPU resources on each allocated node to achieve the best performance. Previous publications [12] have demonstrated a tool that allows for tracking the GPU utilization of individual nodes and the system as a whole, and it has helped to pinpoint low GPU utilization numbers on KFS and its precursor KIDS. This work discusses experiences, strategies, and results that have been applied on the Keeneland Full Scale system to ensure that users are fully utilizing GPU resources and to improve the performance of their calculations while reducing Service Unit (SU) usage. In many cases, these strategies boil down to two factors: user education and code optimization for KFS's unique architecture. Three specific applications are discussed in this context from the molecular science, materials science, and chemistry domains, and recent application support results are used to illustrate how small interventions can greatly increase utilization on a month-to-month basis.

[1]  Kipton Barros,et al.  Solving lattice QCD systems of equations using mixed precision solvers on GPUs , 2009, Comput. Phys. Commun..

[2]  Laxmikant V. Kalé,et al.  Scalable molecular dynamics with NAMD , 2005, J. Comput. Chem..

[3]  Tong Liu,et al.  The development of Mellanox/NVIDIA GPUDirect over InfiniBand—a new model for GPU to GPU communications , 2011, Computer Science - Research and Development.

[4]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[5]  Karsten Schwan,et al.  Keeneland: Bringing Heterogeneous GPU Computing to the Computational Science Community , 2011, Computing in Science & Engineering.

[6]  Joshua A. Anderson,et al.  General purpose molecular dynamics simulations fully implemented on graphics processing units , 2008, J. Comput. Phys..

[7]  Sharon C. Glotzer,et al.  HOOMD-blue, general-purpose many-body dynamics on the GPU , 2010 .

[8]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[9]  Bálint Joó SciDAC-2 software infrastructure for lattice QCD , 2007 .

[10]  A. Rambaut,et al.  BEAST: Bayesian evolutionary analysis by sampling trees , 2007, BMC Evolutionary Biology.

[11]  Daniel L. Ayres,et al.  BEAGLE: An Application Programming Interface and High-Performance Computing Library for Statistical Phylogenetics , 2011, Systematic biology.

[12]  Steven A. Gottlieb,et al.  Scaling lattice QCD beyond 100 GPUs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[13]  Dhabaleswar K. Panda,et al.  Efficient Inter-node MPI Communication Using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs , 2013, 2013 42nd International Conference on Parallel Processing.

[14]  Stephen McNally,et al.  An analysis of GPU utilization trends on the Keeneland initial delivery system , 2012, XSEDE '12.