Evaluation of Programming Models to Address Load Imbalance on Distributed Multi-Core CPUs: A Case Study with Block Low-Rank Factorization

To minimize data movement, many parallel ap- plications statically distribute computational tasks among the processes. However, modern simulations often encounters ir- regular computational tasks whose computational loads change dynamically at runtime or are data dependent. As a result, load imbalance among the processes at each step of simulation is a natural situation that must be dealt with at the programming level. The de facto parallel programming approach, flat MPI (one process per core), is hardly suitable to manage the lack of balance, imposing significant idle time on the simulation as processes have to wait for the slowest process at each step of simulation. One critical application for many domains is the LU factor- ization of a large dense matrix stored in the Block Low-Rank (BLR) format. Using the low-rank format can significantly reduce the cost of factorization in many scientific applications, including the boundary element analysis of electrostatic field. However, the partitioning of the matrix based on underlying geometry leads to different sizes of the matrix blocks whose numerical ranks change at each step of factorization, leading to the load imbalance among the processes at each step of factorization. We use BLR LU factorization as a test case to study the programmability and performance of five different programming approaches: (1) flat MPI, (2) Adaptive MPI (Charm++), (3) MPI + OpenMP, (4) parameterized task graph (PTG), and (5) dynamic task discovery (DTD). The last two versions use a task-based paradigm to express the algorithm; we rely on the PaRSEC run- time system to execute the tasks. We first point out programming features needed to efficiently solve this category of problems, hinting at possible alternatives to the MPI+X programming paradigm. We then evaluate the programmability of the different approaches, detailing our experience implementing the algorithm using each of the models. Finally, we show the performance result on the Intel Haswell–based Bridges system at the Pittsburgh Supercomputing Center (PSC) and analyze the effectiveness of the implementations to address the load imbalance.

[1]  James Demmel,et al.  Communication-optimal Parallel and Sequential QR and LU Factorizations , 2008, SIAM J. Sci. Comput..

[2]  Wolfgang Hackbusch,et al.  A Sparse Matrix Arithmetic Based on H-Matrices. Part I: Introduction to H-Matrices , 1999, Computing.

[3]  Sergej Rjasanow,et al.  Adaptive Low-Rank Approximation of Collocation Matrices , 2003, Computing.

[4]  Jean-Yves L'Excellent,et al.  Improving Multifrontal Methods by Means of Block Low-Rank Representations , 2015, SIAM J. Sci. Comput..

[5]  Yasuhito Takahashi,et al.  Parallel Hierarchical Matrices with Adaptive Cross Approximation on Symmetric Multiprocessing Clusters , 2014, J. Inf. Process..

[6]  Alfredo Buttari,et al.  On the Complexity of the Block Low-Rank Multifrontal Factorization , 2017, SIAM J. Sci. Comput..

[7]  Emmanuel Agullo,et al.  Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model , 2017 .

[8]  Stefan Kurz,et al.  The adaptive cross-approximation technique for the 3D boundary-element method , 2002 .

[9]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[10]  Nathan T. Hjelm,et al.  Give MPI Threading a Fair Chance: A Study of Multithreaded MPI Designs , 2019, 2019 IEEE International Conference on Cluster Computing (CLUSTER).

[11]  Abhishek Gupta,et al.  Parallel Programming with Migratable Objects: Charm++ in Practice , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[12]  George Bosilca,et al.  Hierarchical DAG Scheduling for Hybrid Distributed Systems , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[13]  David E. Keyes,et al.  Exploiting Data Sparsity for Large-Scale Matrix Computations , 2018, Euro-Par.

[14]  Pavel Shamis,et al.  Distributed Task-Based Runtime Systems - Current State and Micro-Benchmark Performance , 2018, 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[15]  Eric Darve,et al.  An O(NlogN)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathcal O (N \log N)$$\end{document} Fast Direct Solver fo , 2013, Journal of Scientific Computing.

[16]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[17]  W. Hackbusch,et al.  On H2-Matrices , 2000 .

[18]  Patrick R. Amestoy,et al.  Bridging the gap between flat and hierarchical low-rank matrix formats: the multilevel BLR format , 2018 .

[19]  Scott B. Baden,et al.  The UPC++ PGAS library for Exascale Computing , 2017, PAW@SC.

[20]  Thomas Hérault,et al.  Dynamic task discovery in PaRSEC: a data-flow task-based runtime , 2017, ScalA@SC.

[21]  Hatem Ltaief,et al.  Extreme-Scale Task-Based Cholesky Factorization Toward Climate and Weather Prediction Applications , 2020, PASC.

[22]  Patrick R. Amestoy,et al.  Performance and Scalability of the Block Low-Rank Multifrontal Factorization on Multicore Architectures , 2019, ACM Trans. Math. Softw..

[23]  Yasuhito Takahashi,et al.  Software framework for parallel BEM analyses with H-matrices , 2016 .

[24]  Shivkumar Chandrasekaran,et al.  A Fast ULV Decomposition Solver for Hierarchically Semiseparable Representations , 2006, SIAM J. Matrix Anal. Appl..

[25]  Thomas Hérault,et al.  PaRSEC: Exploiting Heterogeneity to Enhance Scalability , 2013, Computing in Science & Engineering.

[26]  Jack Dongarra,et al.  Distributed-memory lattice H -matrix factorization , 2019, Int. J. High Perform. Comput. Appl..

[27]  Laxmikant V. Kalé,et al.  Multi-Level Load Balancing with an Integrated Runtime Approach , 2018, 2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID).

[28]  Jesús Labarta,et al.  Integrating Blocking and Non-Blocking MPI Primitives with Task-Based Programming Models , 2019, Parallel Comput..