High-performance Cholesky factorization for GPU-only execution

We present our performance analysis, algorithm designs, and the optimizations needed for the development of high-performance GPU-only algorithms, and in particular, for the dense Cholesky factorization. In contrast to currently promoted designs that solve parallelism challenges on multicore architectures by representing algorithms as Directed Acyclic Graphs (DAGs), where nodes are tasks of fine granularity and edges are the dependencies between the tasks, our designs explicitly target manycore architectures like GPUs and feature coarse granularity tasks (that can be hierarchically split into fine grain data-parallel subtasks). Furthermore, in contrast to hybrid algorithms that schedule difficult to parallelize tasks on CPUs, we develop highly-efficient code for entirely GPU execution. GPU-only codes remove the expensive CPU-to-GPU communications and the tuning challenges related to slow CPU and/or low CPU-to-GPU bandwidth. We show that on latest GPUs, like the P100, this becomes so important that the GPU-only code even outperforms the hybrid MAGMA algorithms when the CPU tasks and communications can not be entirely overlapped with GPU computations. Weachieve up to 4,300 GFlop/s in double precision on a P100 GPU, which is about 7-8x faster than high-end multicore CPUs, e.g., two 10-cores Intel Xeon E5-2650 v3 Haswell CPUs, where MKL runs up to about 500-600 Gflop/s. The new algorithm also outperforms significantly the GPU-only implementation currently available in the NVIDIA cuSOLVER library.

[1]  Jack Dongarra,et al.  Model-Driven One-Sided Factorizations on Multicore Accelerated Systems , 2014, Supercomput. Front. Innov..

[2]  Jack Dongarra,et al.  Faster, Cheaper, Better { a Hybridization Methodology to Develop Linear Algebra Software for GPUs , 2010 .

[3]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[4]  Jack Dongarra,et al.  A Proposed API for Batched Basic Linear Algebra Subprograms , 2016 .

[5]  Hans Werner Meuer,et al.  Top500 Supercomputer Sites , 1997 .

[6]  Jack J. Dongarra,et al.  HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi , 2015, Sci. Program..

[7]  Jack J. Dongarra,et al.  MAGMA embedded: Towards a dense linear algebra library for energy efficient extreme computing , 2015, 2015 IEEE High Performance Extreme Computing Conference (HPEC).

[8]  Jack J. Dongarra,et al.  A Framework for Batched and GPU-Resident Factorization Algorithms Applied to Block Householder Transformations , 2015, ISC.

[9]  Jack J. Dongarra,et al.  Performance Tuning and Optimization Techniques of Fixed and Variable Size Batched Cholesky Factorization on GPUs , 2016, ICCS.

[10]  Jack J. Dongarra,et al.  Unified Development for Mixed Multi-GPU and Multi-coprocessor Environments Using a Lightweight Runtime Environment , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[11]  Jack J. Dongarra,et al.  LU, QR, and Cholesky factorizations: Programming model, performance analysis and optimization techniques for the Intel Knights Landing Xeon Phi , 2016, 2016 IEEE High Performance Extreme Computing Conference (HPEC).

[12]  Jack J. Dongarra,et al.  LU Factorization of Small Matrices: Accelerating Batched DGETRF on the GPU , 2014, 2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS).

[13]  James Demmel,et al.  LAPACK Users' Guide, Third Edition , 1999, Software, Environments and Tools.

[14]  Jack J. Dongarra,et al.  A Fast Batched Cholesky Factorization on a GPU , 2014, 2014 43rd International Conference on Parallel Processing.

[15]  Jack Dongarra,et al.  clMAGMA: high performance dense linear algebra with OpenCL , 2014, IWOCL '14.

[16]  Allen D. Malony,et al.  Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs , 2011, 2011 International Conference on Parallel Processing.

[17]  Massimiliano Fatica Accelerating linpack with CUDA on heterogenous clusters , 2009, GPGPU-2.

[18]  Jack J. Dongarra,et al.  Batched matrix computations on hardware accelerators based on GPUs , 2015, Int. J. High Perform. Comput. Appl..

[19]  Jack J. Dongarra,et al.  Towards dense linear algebra for hybrid GPU accelerated manycore systems , 2009, Parallel Comput..

[20]  Jack J. Dongarra,et al.  A set of level 3 basic linear algebra subprograms , 1990, TOMS.

[21]  Stanimire Tomov,et al.  One-sided Dense Matrix Factorizations on a Multicore with Multiple GPU Accelerators , 2012, ICCS.

[22]  Jack J. Dongarra,et al.  On the Development of Variable Size Batched Computation for Heterogeneous Parallel Architectures , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).