Autotuning Numerical Dense Linear Algebra for Batched Computation With GPU Hardware Accelerators

Computational problems in engineering and scientific disciplines often rely on the solution of many instances of small systems of linear equations, which are called batched solves. In this paper, we focus on the important variants of both batch Cholesky factorization and subsequent substitution. The former requires the linear system matrices to be symmetric positive definite (SPD). We describe the implementation and automated performance engineering of these kernels that implement the factorization and the two substitutions. Our target platforms are graphics processing units (GPUs), which over the past decade have become an attractive high-performance computing (HPC) target for solvers of linear systems of equations. Due to their throughput-oriented design, GPUs exhibit the highest processing rates among the available processors. However, without careful design and coding, this speed is mostly restricted to large matrix sizes. We show an automated exploration of the implementation space as well as a new data layout for the batched class of SPD solvers. Our tests involve the solution of many thousands of linear SPD systems of exactly the same size. The primary focus of our techniques is on the individual matrices in the batch that have dimensions ranging from 5-by-5 up to 100-by-100. We compare our autotuned solvers against the state-of-the-art solvers such as those provided through NVIDIA channels and publicly available in the optimized MAGMA library. The observed performance is competitive and many times superior for many practical cases. The advantage of the presented methodology lies in achieving these results in a portable manner across matrix storage formats and GPU hardware architecture platforms.

[1]  Thomas K. Gaylord,et al.  Rigorous coupled-wave analysis of metallic surface-relief gratings , 1986 .

[2]  Jack J. Dongarra,et al.  Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs , 2016, IEEE Transactions on Parallel and Distributed Systems.

[3]  Keshav Pingali,et al.  Look Left, Look Right, Look Left Again: An Application of Fractal Symbolic Analysis to Linear Algebra Code Restructuring , 2004, International Journal of Parallel Programming.

[4]  Shoaib Ashraf Kamil,et al.  Productive High Performance Parallel Programming with Auto-tuned Domain-Specific Embedded Languages , 2012 .

[5]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[6]  F. Franchetti,et al.  Automatic Application Tuning for HPC Architectures , 2014 .

[7]  Michael T. Heath,et al.  High-performance hybrid CPU and GPU parallel algorithm for digital volume correlation , 2015, Int. J. High Perform. Comput. Appl..

[8]  Jack J. Dongarra,et al.  Search Space Generation and Pruning System for Autotuners , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[9]  Anamitra R. Choudhury,et al.  Multifrontal Factorization of Sparse SPD Matrices on GPUs , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[10]  Walter F. Tichy,et al.  Atune-IL: An Instrumentation Language for Auto-tuning Parallel Applications , 2009, Euro-Par.

[11]  Jack J. Dongarra,et al.  Autotuning GEMM Kernels for the Fermi GPU , 2012, IEEE Transactions on Parallel and Distributed Systems.

[12]  Jack J. Dongarra,et al.  Towards batched linear solvers on accelerated hardware platforms , 2015, PPOPP.

[13]  Hugh Alan Bruck,et al.  Digital image correlation using Newton-Raphson method of partial differential correction , 1989 .

[14]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[15]  Jack J. Dongarra,et al.  Acceleration of GPU-based Krylov solvers via data transfer reduction , 2015, Int. J. High Perform. Comput. Appl..

[16]  Jack J. Dongarra,et al.  Batched matrix computations on hardware accelerators based on GPUs , 2015, Int. J. High Perform. Comput. Appl..

[17]  Jack Dongarra,et al.  Sparse direct solvers with accelerators over DAG runtimes , 2012 .

[18]  James Demmel,et al.  Optimizing matrix multiply using PHiPAC: a portable, high-performance, ANSI C coding methodology , 1997, ICS '97.

[19]  Jack J. Dongarra,et al.  Autotuning Batch Cholesky Factorization in CUDA with Interleaved Layout of Matrices , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[20]  Jack J. Dongarra,et al.  Accelerating collaborative filtering using concepts from high performance computing , 2015, 2015 IEEE International Conference on Big Data (Big Data).

[21]  Massimiliano Fatica,et al.  Power/Performance Trade-Offs of Small Batched LU Based Solvers on GPUs , 2013, Euro-Par.

[22]  Katherine Yelick,et al.  OSKI: A library of automatically tuned sparse matrix kernels , 2005 .

[23]  Keshav Pingali,et al.  Fractal symbolic analysis , 2000, TOPL.

[24]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[25]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[26]  Richard Veras,et al.  Capturing the Expert: Generating Fast Matrix-Multiply Kernels with Spiral , 2014, VECPAR.

[27]  Jack J. Dongarra,et al.  Dense linear algebra solvers for multicore with GPU accelerators , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[28]  Ken Kennedy,et al.  Automatic blocking of QR and LU factorizations for locality , 2004, MSP '04.

[29]  Yifan Hu,et al.  Collaborative Filtering for Implicit Feedback Datasets , 2008, 2008 Eighth IEEE International Conference on Data Mining.

[30]  Jack J. Dongarra,et al.  Performance-Portable Autotuning of OpenCL Kernels for Convolutional Layers of Deep Neural Networks , 2016, 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC).

[31]  Jack J. Dongarra,et al.  LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU , 2014, 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS).

[32]  Franz Franchetti,et al.  SPIRAL: Code Generation for DSP Transforms , 2005, Proceedings of the IEEE.

[33]  Jack J. Dongarra,et al.  A Fast Batched Cholesky Factorization on a GPU , 2014, 2014 43rd International Conference on Parallel Processing.

[34]  Viktor K. Prasanna,et al.  Tiling, Block Data Layout, and Memory Hierarchy Performance , 2003, IEEE Trans. Parallel Distributed Syst..

[35]  Franz Franchetti,et al.  FFTs with Near-Optimal Memory Access Through Block Data Layouts: Algorithm, Architecture and Design Automation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).