QCD Library for GPU Cluster with Proprietary Interconnect for GPU Direct Communication

QUDA is a Lattice QCD library that can use NVIDIA's Graphics Processing Unit GPU accelerators, and is widely used as a framework for Lattice QCD applications. In this paper, we apply our novel proprietary interconnect network called the Tightly Coupled Accelerators TCA architecture, to inter-node GPU communication in QUDA. The TCA architecture was developed for low-latency inter-node communication among accelerators connected through the PCI Express PCIe bus on PC clusters. It enables direct memory copy between accelerators, such as GPUs, over nodes in the same manner as an intra-node PCIe transaction. We assess the performance of TCA on QUDA by a high-density GPU cluster HA-PACS/TCA, which is a proof-of-concept testbed for TCA architecture. The results show that our interconnection network system, which effects a stronger scaling than ordinary InfiniBand solutions on PC clusters with GPUs, significantly reduces communication latency. The execution time for Conjugate Gradient CG iteration shows that the TCA implementation is 2.14 times faster than peer-to-peer MPI implementation and 1.96 times faster than MPI remote-memory access RMA implementation, where InfiniBand QDRx2 rail network is used in both cases.

[1]  Mitsuhisa Sato,et al.  Interconnection Network for Tightly Coupled Accelerators Architecture , 2013, 2013 IEEE 21st Annual Symposium on High-Performance Interconnects.

[2]  Davide Rossetti,et al.  APEnet+: a 3D Torus network optimized for GPU-based HPC Systems , 2012 .

[3]  Pier Stanislao Paolucci,et al.  APEnet+: high bandwidth 3D torus direct network for petaflops scale commodity clusters , 2011, ArXiv.

[4]  Mitsuhisa Sato,et al.  Tightly Coupled Accelerators Architecture for Minimizing Communication Latency among Accelerators , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[5]  Kipton Barros,et al.  Solving lattice QCD systems of equations using mixed precision solvers on GPUs , 2009, Comput. Phys. Commun..

[6]  Steven A. Gottlieb,et al.  Scaling lattice QCD beyond 100 GPUs , 2011, 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC).