Performance Portability of a GPU Enabled Factorization with the DAGuE Framework

Performance portability is a major challenge faced today by developers on heterogeneous high performance computers, consisting of an interconnect, memory with non-uniform access, many-cores and accelerators like GPUs. Recent studies have successfully demonstrated that dense linear algebra operations can be efficiently handled by runtime systems using a DAG representation. In this work, we present the GPU subsystem of the DAGuE runtime, and assess, on the Cholesky factorization test case, the minimal efforts required by a programmer to enable GPU acceleration in the DAGuE framework. The performance achieved by this unchanged code, on a variety of heterogeneous and distributed many cores and GPU resources, demonstrates the desired performance portability.

[1]  John A. Sharp,et al.  Data flow computing: theory and practice , 1992 .

[2]  David P. Anderson,et al.  Accelerating the MilkyWay@Home Volunteer Computing Project with GPUs , 2009, PPAM.

[3]  Jack J. Dongarra,et al.  A Scalable High Performant Cholesky Factorization for Multicore with GPU Accelerators , 2010, VECPAR.

[4]  Robert A. van de Geijn,et al.  Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures , 2007, SPAA '07.

[5]  Rajkumar Buyya,et al.  A Taxonomy of Workflow Management Systems for Grid Computing , 2005, Proceedings of the 38th Annual Hawaii International Conference on System Sciences.

[6]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[7]  Thomas Hérault,et al.  Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[8]  Jack Dongarra,et al.  An Improved MAGMA GEMM for Fermi GPUs , 2010 .

[9]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[10]  Robert A. van de Geijn,et al.  Retargeting PLAPACK to clusters with hardware accelerators , 2010, 2010 International Conference on High Performance Computing & Simulation.

[11]  Julien Langou,et al.  The Impact of Multicore on Math Software , 2006, PARA.

[12]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[13]  Massimiliano Fatica Accelerating linpack with CUDA on heterogenous clusters , 2009, GPGPU-2.

[14]  Emmanuel Agullo,et al.  QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[15]  Serge G. Petiton,et al.  Workflow Global Computing with YML , 2006, 2006 7th IEEE/ACM International Conference on Grid Computing.

[16]  R. Dolbeau,et al.  HMPP TM : A Hybrid Multi-core Parallel Programming Environment , 2022 .

[17]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.