Task-based Parallel Programming for Scalable Matrix Product Algorithms

Task-based programming models have succeeded in gaining the interest of the high-performance mathematical software community because they relieve part of the burden of developing and implementing distributed-memory parallel algorithms in an efficient and portable way.In increasingly larger, more heterogeneous clusters of computers, these models appear as a way to maintain and enhance more complex algorithms. However, task-based programming models lack the flexibility and the features that are necessary to express in an elegant and compact way scalable algorithms that rely on advanced communication patterns. We show that the Sequential Task Flow paradigm can be extended to write compact yet efficient and scalable routines for linear algebra computations. Although, this work focuses on dense General Matrix Multiplication, the proposed features enable the implementation of more complex algorithms. We describe the implementation of these features and of the resulting GEMM operation. Finally, we present an experimental analysis on two homogeneous supercomputers showing that our approach is competitive up to 32,768 CPU cores with state-of-the-art libraries and may outperform them for some problem dimensions. Although our code can use GPUs straightforwardly, we do not deal with this case because it implies other issues which are out of the scope of this work.

[1]  Jeffrey S. Vetter,et al.  IRIS: A Portable Runtime System Exploiting Multiple Heterogeneous Programming Systems , 2021, 2021 IEEE High Performance Extreme Computing Conference (HPEC).

[2]  Emmanuel Jeannot,et al.  Using Dynamic Broadcasts to Improve Task-Based Runtime Performances , 2020, Euro-Par.

[3]  Tsung-Wei Huang,et al.  Taskflow: A Lightweight Parallel and Heterogeneous Task Graph Computing System , 2020, IEEE Transactions on Parallel and Distributed Systems.

[4]  Jack Dongarra,et al.  Generic Matrix Multiplication for Multi-GPU Accelerated Distributed-Memory Platforms over PaRSEC , 2019, 2019 IEEE/ACM 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA).

[5]  Xiaoye S. Li,et al.  A communication-avoiding 3D algorithm for sparse LU factorization on heterogeneous systems , 2019, J. Parallel Distributed Comput..

[6]  Emmanuel Agullo,et al.  Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model , 2017 .

[7]  Eduard Ayguadé,et al.  Improving the Integration of Task Nesting and Dependencies in OpenMP , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[8]  Martin D. Schatz,et al.  Parallel Matrix Multiplication: A Systematic Journey , 2016, SIAM J. Sci. Comput..

[9]  George Bosilca,et al.  Exploiting a Parametrized Task Graph Model for the Parallelization of a Sparse Direct Multifrontal Solver , 2016, Euro-Par Workshops.

[10]  Emmanuel Agullo,et al.  Implementing Multifrontal Sparse Solvers for Multicore Architectures with Sequential Task Flow Runtime Systems , 2016, ACM Trans. Math. Softw..

[11]  Thomas Hérault,et al.  PaRSEC: Exploiting Heterogeneity to Enhance Scalability , 2013, Computing in Science & Engineering.

[12]  Robert A. van de Geijn,et al.  Scheduling algorithms‐by‐blocks on small clusters , 2013, Concurr. Comput. Pract. Exp..

[13]  Katherine A. Yelick,et al.  Communication avoiding and overlapping for numerical linear algebra , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  James Demmel,et al.  Communication-Optimal Parallel 2.5D Matrix Multiplication and LU Factorization Algorithms , 2011, Euro-Par.

[15]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[16]  Robert A. van de Geijn,et al.  Programming matrix algorithms-by-blocks for thread-level parallelism , 2009, TOMS.

[17]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[18]  James Demmel,et al.  Minimizing Communication in Numerical Linear Algebra , 2009, SIAM J. Matrix Anal. Appl..

[19]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[20]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[21]  Dror Irony,et al.  Communication lower bounds for distributed-memory matrix multiplication , 2004, J. Parallel Distributed Comput..

[22]  Robert A. van de Geijn,et al.  Using PLAPACK - parallel linear algebra package , 1997 .

[23]  James Demmel,et al.  ScaLAPACK: A Linear Algebra Library for Message-Passing Computers , 1997, PPSC.

[24]  Ramesh C. Agarwal,et al.  A three-dimensional approach to parallel matrix multiplication , 1995, IBM J. Res. Dev..

[25]  Robert A. van de Geijn,et al.  SUMMA: scalable universal matrix multiplication algorithm , 1995, Concurr. Pract. Exp..

[26]  Ramesh C. Agarwal,et al.  A high-performance matrix-multiplication algorithm on a distributed-memory parallel computer, using overlapped communication , 1994, IBM J. Res. Dev..

[27]  Asim YarKhan,et al.  Dynamic Task Execution on Shared and Distributed Memory Architectures , 2012 .

[28]  Padma Raghavan,et al.  Parallel Processing for Scientific Computing , 2006, Software, Environments, Tools.

[29]  Patrick Amestoy,et al.  A Fully Asynchronous Multifrontal Solver Using Distributed Dynamic Scheduling , 2001, SIAM J. Matrix Anal. Appl..

[30]  Cleve Ashcraft,et al.  The Fan-Both Family of Column-Based Distributed Cholesky Factorization Algorithms , 1993 .

[31]  Lynn Elliot Cannon,et al.  A cellular computer to implement the kalman filter algorithm , 1969 .