MAGMA templates for scalable linear algebra on emerging architectures

With the acquisition and widespread use of more resources that rely on accelerator/wide vector–based computing, there has been a strong demand for science and engineering applications to take advantage of these latest assets. This, however, has been extremely challenging due to the diversity of systems to support their extreme concurrency, complex memory hierarchies, costly data movement, and heterogeneous node architectures. To address these challenges, we design a programming model and describe its ease of use in the development of a new MAGMA Templates library that delivers high-performance scalable linear algebra portable on current and emerging architectures. MAGMA Templates derives its performance and portability by (1) building on existing state-of-the-art linear algebra libraries, like MAGMA, SLATE, Trilinos, and vendor-optimized math libraries, and (2) providing access (seamlessly to the users) to the latest algorithms and architecture-specific optimizations through a single, easy-to-use C++-based API.

[1]  Daniel Sunderland,et al.  Kokkos: Enabling manycore performance portability through polymorphic memory access patterns , 2014, J. Parallel Distributed Comput..

[2]  Jack Dongarra,et al.  MAGMA-sparse Interface Design Whitepaper , 2017 .

[3]  Brian Vinter,et al.  CSR5: An Efficient Storage Format for Cross-Platform Sparse Matrix-Vector Multiplication , 2015, ICS.

[4]  Nicholas J. Higham,et al.  Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers , 2018, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  Jack Dongarra,et al.  C++ API for BLAS and LAPACK , 2017 .

[6]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[7]  Jack Dongarra,et al.  Parallel Programming Models for Dense Linear Algebra on Heterogeneous Systems , 2015, Supercomput. Front. Innov..

[8]  Yousef Saad,et al.  GPU-accelerated preconditioned iterative linear solvers , 2013, The Journal of Supercomputing.

[9]  Jack J. Dongarra,et al.  Automated empirical optimizations of software and the ATLAS project , 2001, Parallel Comput..

[10]  Timothy C. Warburton,et al.  OCCA: A unified approach to multi-threading languages , 2014, ArXiv.

[11]  Jack J. Dongarra,et al.  SLATE: design of a modern distributed and accelerated linear algebra library , 2019, SC.

[12]  David E. Keyes,et al.  Extreme Scale FMM-Accelerated Boundary Integral Equation Solver for Wave Scattering , 2018, SIAM J. Sci. Comput..

[13]  David E. Keyes,et al.  Optimizations of Unstructured Aerodynamics Computations for Many-core Architectures , 2018, IEEE Transactions on Parallel and Distributed Systems.

[14]  William Gropp,et al.  A hybrid format for better performance of sparse matrix-vector multiplication on a GPU , 2016, Int. J. High Perform. Comput. Appl..

[15]  Jack J. Dongarra,et al.  Massively Parallel Automated Software Tuning , 2019, ICPP.

[16]  Yuefan Deng,et al.  New trends in high performance computing , 2001, Parallel Computing.

[17]  J. Dongarra,et al.  Implementing a Sparse Matrix Vector Product for the SELL-C / SELL-C-σ formats on NVIDIA GPUs , 2014 .

[18]  Jack J. Dongarra,et al.  Towards dense linear algebra for hybrid GPU accelerated manycore systems , 2009, Parallel Comput..

[19]  David E. Keyes,et al.  Performance Evaluation of Computation and Communication Kernels of the Fast Multipole Method on Intel Manycore Architecture , 2017, Euro-Par.

[20]  Jack J. Dongarra,et al.  A Note on Auto-tuning GEMM for GPUs , 2009, ICCS.

[21]  Jack Dongarra,et al.  Designing SLATE: Software for Linear Algebra Targeting Exascale , 2017 .

[22]  David E. Keyes,et al.  Unstructured computational aerodynamics on many integrated core architecture , 2014, Parallel Comput..

[23]  Tamara G. Kolda,et al.  An overview of the Trilinos project , 2005, TOMS.

[24]  Jack Dongarra,et al.  Roadmap for the Development of a Linear Algebra Library for Exascale Computing: SLATE: Software for Linear Algebra Targeting Exascale , 2017 .

[25]  Jack J. Dongarra,et al.  Investigating half precision arithmetic to accelerate dense linear system solvers , 2017, ScalA@SC.

[26]  Jack Dongarra,et al.  Least squares solvers for distributed-memory machines with GPU accelerators , 2019, ICS.

[27]  Mohammed Al Farhan,et al.  Unstructured Computations on Emerging Architectures , 2019 .

[28]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .