Implementation of CG Method on GPU Cluster with Proprietary Interconnect TCA for GPU Direct Communication

We have been developing a proprietary interconnect technology called Tightly Coupled Accelerators (TCA) architecture to improve communication latency and bandwidth between compute nodes on a GPU cluster. This paper describes the Conjugate Gradient (CG) method implementation using TCA and results of performance evaluation on the HA-PACS/TCA system, which is a proof-of-concept GPU cluster based on the TCA concept. The implementation uses the TCA for all gather and all reduce collective communications. Comparison results between the implementation using TCA and an implementation using MPI show that the TCA contributes to reduce latency for relatively small data gathering on the all gather and demonstrate about twice faster speed on the all reduce. As a result, the CG method implementation using TCA outperforms the implementation using MPI for sparse matrices whose matrix size is thousands to tens of thousands.

[1]  Harold S. Stone,et al.  A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations , 1973, IEEE Transactions on Computers.

[2]  Tarek M. Taha,et al.  A communication reduction approach to iteratively solve large sparse linear systems on a GPGPU cluster , 2013, Cluster Computing.

[3]  Debra Hensgen,et al.  Two algorithms for barrier synchronization , 1988, International Journal of Parallel Programming.

[4]  Eli Upfal,et al.  Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[5]  Mitsuhisa Sato,et al.  Tightly Coupled Accelerators Architecture for Minimizing Communication Latency among Accelerators , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[6]  Mitsuhisa Sato,et al.  PEACH2: An FPGA-based PCIe network device for Tightly Coupled Accelerators , 2014, CARN.

[7]  Taisuke Boku,et al.  QCD Library for GPU Cluster with Proprietary Interconnect for GPU Direct Communication , 2014, Euro-Par Workshops.

[8]  Yousef Saad,et al.  Iterative methods for sparse linear systems , 2003 .

[9]  Zbigniew J. Czech,et al.  Introduction to Parallel Computing , 2017 .

[10]  Mitsuhisa Sato,et al.  Interconnection Network for Tightly Coupled Accelerators Architecture , 2013, 2013 IEEE 21st Annual Symposium on High-Performance Interconnects.

[11]  Yunquan Zhang,et al.  Performance evaluation of Allgather algorithms on terascale Linux cluster with fast Ethernet , 2005, Eighth International Conference on High-Performance Computing in Asia-Pacific Region (HPCASIA'05).

[12]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[13]  Satoshi Matsuoka,et al.  High performance conjugate gradient solver on multi-GPU clusters using hypergraph partitioning , 2010, Computer Science - Research and Development.

[14]  Gene H. Golub,et al.  Matrix computations , 1983 .