Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs

Solving a large number of relatively small linear systems has recently drawn more attention in the HPC community, due to the importance of such computational workloads in many scientific applications, including sparse multifrontal solvers. Modern hardware accelerators and their architecture require a set of optimization techniques that are very different from the ones used in solving one relatively large matrix. In order to impose concurrency on such throughput-oriented architectures, a common practice is to batch the solution of these matrices as one task offloaded to the underlying hardware, rather than solving them individually.This paper presents a high performance batched Cholesky factorization on large sets of relatively small matrices using Graphics Processing Units (GPUs), and addresses both fixed and variable size batched problems. We investigate various algorithm designs and optimization techniques, and show that it is essential to combine kernel design with performance tuning in order to achieve the best possible performance. We compare our approaches against state-of-the-art CPU solutions as well as GPU-based solutions using existing libraries, and show that, on a K40c GPU for example, our kernels are more than 2 faster.

[1]  Jack J. Dongarra,et al.  A Fast Batched Cholesky Factorization on a GPU , 2014, 2014 43rd International Conference on Parallel Processing.

[2]  Jack Dongarra,et al.  Model-Driven One-Sided Factorizations on Multicore Accelerated Systems , 2014, Supercomput. Front. Innov..

[3]  Jack J. Dongarra,et al.  Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs , 2016, IEEE Transactions on Parallel and Distributed Systems.

[4]  Antonino Tumeo,et al.  Accelerating subsurface transport simulation on heterogeneous clusters , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[5]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[6]  Jack J. Dongarra,et al.  Batched matrix computations on hardware accelerators based on GPUs , 2015, Int. J. High Perform. Comput. Appl..

[7]  Jack J. Dongarra,et al.  Dense linear algebra solvers for multicore with GPU accelerators , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[8]  Allen D. Malony,et al.  Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs , 2011, 2011 International Conference on Parallel Processing.

[9]  Jack J. Dongarra,et al.  A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations , 2015, ISC.

[10]  Jack Dongarra,et al.  Faster, Cheaper, Better { a Hybridization Methodology to Develop Linear Algebra Software for GPUs , 2010 .

[11]  Massimiliano Fatica,et al.  Power/Performance Trade-Offs of Small Batched LU Based Solvers on GPUs , 2013, Euro-Par.

[12]  James Demmel,et al.  LU, QR and Cholesky Factorizations using Vector Capabilities of GPUs , 2008 .

[13]  Jack J. Dongarra,et al.  Towards batched linear solvers on accelerated hardware platforms , 2015, PPOPP.