Effective Minimally-Invasive GPU Acceleration of Distributed Sparse Matrix Factorization

Sparse matrix factorization, a critical algorithm in many science and engineering applications, has had difficulty leveraging the additional computational power afforded by the infusion of heterogeneous accelerators in HPC clusters. We present a minimally invasive approach to the GPU acceleration of a hybrid multifrontal solver, the Watson Sparse Matrix Package, which is already highly optimized for the CPU and exhibits leading performance on distributed architectures. The novel aspect of this work is to demonstrate techniques for achieving substantial GPU acceleration, upi¾źto 3.5x, of the sparse factorization with strategic, but contained changes to the original, CPU-only, code. Strong scaling results show that performance benefits scale to as many as 512 nodes 4096 cores of the Blue Waters supercomputer at NCSA. The techniques presented here suggest that detailed code reorganization may not be necessary to achieve substantial acceleration from GPUs, even for complex algorithms with highly irregular compute and data access patterns, like those used for distributed sparse factorization.

[1]  Timothy A. Davis,et al.  Algorithm 9xx: Sparse QR Factorization on the GPU , 2015 .

[2]  John K. Reid,et al.  The Multifrontal Solution of Indefinite Sparse Symmetric Linear , 1983, TOMS.

[3]  John R. Rice,et al.  A Grid-Based Subtree-Subcube Assignment Strategy for Solving Partial Differential Equations on Hypercubes , 1992, SIAM J. Sci. Comput..

[4]  Vipin Kumar,et al.  Highly Scalable Parallel Algorithms for Sparse Matrix Factorization , 1997, IEEE Trans. Parallel Distributed Syst..

[5]  Joseph W. H. Liu The role of elimination trees in sparse factorization , 1990 .

[6]  Seid Koric,et al.  Sparse linear solvers on massively parallel machines , 2009 .

[7]  Gene H. Golub,et al.  Matrix computations (3rd ed.) , 1996 .

[8]  Anshul Gupta A Shared- and Distributed-Memory Parallel Sparse Direct Solver , 2004, PARA.

[9]  Helmar Burkhart,et al.  Algorithmic performance studies on graphics processing units , 2008, J. Parallel Distributed Comput..

[10]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[11]  Roger Grimes,et al.  The influence of relaxed supernode partitions on the multifrontal method , 1989, TOMS.

[12]  Timothy A. Davis,et al.  Accelerating sparse cholesky factorization on GPUs , 2014, IA3 '14.

[13]  Seid Koric,et al.  Evaluation of parallel direct sparse linear solvers in electromagnetic geophysical problems , 2016, Comput. Geosci..

[14]  Joseph W. H. Liu,et al.  The Multifrontal Method for Sparse Matrix Solution: Theory and Practice , 1992, SIAM Rev..

[15]  Richard W. Vuduc,et al.  A Distributed CPU-GPU Sparse Direct Solver , 2014, Euro-Par.

[16]  Qiyue Lu,et al.  Evaluation of massively parallel linear sparse solvers on unstructured finite element meshes , 2014 .

[17]  George Bosilca,et al.  Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[18]  Anamitra R. Choudhury,et al.  Multifrontal Factorization of Sparse SPD Matrices on GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.