Multifrontal Factorization of Sparse SPD Matrices on GPUs

Solving large sparse linear systems is often the most computationally intensive component of many scientific computing applications. In the past, sparse multifrontal direct factorization has been shown to scale to thousands of processors on dedicated supercomputers resulting in a substantial reduction in computational time. In recent years, an alternative computing paradigm based on GPUs has gained prominence, primarily due to its affordability, power-efficiency, and the potential to achieve significant speedup relative to desktop performance on regular and structured parallel applications. However, sparse matrix factorization on GPUs has not been explored sufficiently due to the complexity involved in an efficient implementation and concerns of low GPU utilization. In this paper, we present an adaptive hybrid approach for accelerating sparse multifrontal factorization based on a judicious exploitation of the processing power of the host CPU and GPU. We present four different policies for distributing and scheduling the workload between the host CPU and the GPU, and propose a mechanism for a runtime selection of the appropriate policy for each step of sparse Cholesky factorization. This mechanism relies on auto-tuning based on modeling the best policy predictor as a parametric classifier. We estimate the classifier parameters from the available empirical computation time data such that the expected computation time is minimized. This approach is readily adaptable for using the current or an extended set of policies for different CPU-GPU combinations as well as for different combinations of dense kernels for both the CPU and the GPU.

[1]  Helmar Burkhart,et al.  General-Purpose Sparse Matrix Building Blocks using the NVIDIA CUDA Technology Platform , 2007 .

[2]  John K. Reid,et al.  The Multifrontal Solution of Indefinite Sparse Symmetric Linear , 1983, TOMS.

[3]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[4]  Vipin Kumar,et al.  Highly Scalable Parallel Algorithms for Sparse Matrix Factorization , 1997, IEEE Trans. Parallel Distributed Syst..

[5]  Nicholas I. M. Gould,et al.  A numerical evaluation of sparse direct solvers for the solution of large sparse symmetric linear systems of equations , 2007, TOMS.

[6]  Seid Koric,et al.  Sparse matrix factorization on massively parallel computers , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[7]  Jack J. Dongarra,et al.  A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators , 2010, VECPAR.

[8]  Dinesh Manocha,et al.  LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[9]  R. C. Whaley,et al.  Automatically Tuned Linear Algebra Software (ATLAS) , 2011, Encyclopedia of Parallel Computing.

[10]  Joseph W. H. Liu,et al.  The Multifrontal Method for Sparse Matrix Solution: Theory and Practice , 1992, SIAM Rev..

[11]  Jack J. Dongarra,et al.  Automatically Tuned Linear Algebra Software , 1998, Proceedings of the IEEE/ACM SC98 Conference.

[12]  Eitan Grinspun,et al.  Sparse matrix solvers on the GPU: conjugate gradients and multigrid , 2003, SIGGRAPH Courses.

[13]  Roger Grimes,et al.  Multifrontal Computations on GPUs and Their Multi-core Hosts , 2010, VECPAR.

[14]  James Demmel,et al.  Benchmarking GPUs to tune dense linear algebra , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Murat Efe Guney,et al.  On the limits of GPU acceleration , 2010 .

[16]  Victor Eijkhout,et al.  Self-adapting numerical software (SANS) effort , 2006, IBM J. Res. Dev..

[17]  Manish Parashar,et al.  Solving Sparse Linear Systems on NVIDIA Tesla GPUs , 2009, ICCS.