Impact study of data locality on task-based applications through the Heteroprio scheduler

The task-based approach has emerged as a viable way to effectively use modern heterogeneous computing nodes. It allows the development of parallel applications with an abstraction of the hardware by delegating task distribution and load balancing to a dynamic scheduler. In this organization, the scheduler is the most critical component that solves the DAG scheduling problem in order to select the right processing unit for the computation of each task. In this work, we extend our Heteroprio scheduler that was originally created to execute the fast multipole method on multi-GPUs nodes. We improve Heteroprio by taking into account data locality during task distribution. The main principle is to use different task-lists for the different memory nodes and to investigate how locality affinity between the tasks and the different memory nodes can be evaluated without looking at the tasks’ dependencies. We evaluate the benefit of our method on two linear algebra applications and a stencil code. We show that simple heuristics can provide significant performance improvement and cut by more than half the total memory transfer of an execution. Subjects Distributed and Parallel Computing

[1]  Olivier Beaumont,et al.  Fast approximation algorithms for task‐based runtime systems , 2018, Concurr. Comput. Pract. Exp..

[2]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[3]  Olivier Beaumont,et al.  Approximation Proofs of a Fast and Efficient List Scheduling Algorithm for Task-Based Runtime Systems on Multicores and GPUs , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[4]  Ling Yuan,et al.  A Novel Task-Duplication Based Clustering Algorithm for Heterogeneous Computing Environments , 2019, IEEE Transactions on Parallel and Distributed Systems.

[5]  Terry Cojean,et al.  Scheduling of Linear Algebra Kernels on Multiple Heterogeneous Resources , 2016, 2016 IEEE 23rd International Conference on High Performance Computing (HiPC).

[6]  Jean Roman,et al.  Design and Analysis of a Task-based Parallelization over a Runtime System of an Explicit Finite-Volume CFD Code with Adaptive Time Stepping , 2017, J. Comput. Sci..

[7]  David E. Keyes,et al.  Exploiting Data Sparsity for Large-Scale Matrix Computations , 2018, Euro-Par.

[8]  Emmanuel Agullo,et al.  Are Static Schedules so Bad? A Case Study on Cholesky Factorization , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[9]  Thomas Hérault,et al.  PTG: An Abstraction for Unhindered Parallelism , 2014, 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing.

[10]  Martin Tillenius,et al.  SuperGlue: A Shared Memory Framework Using Data Versioning for Dependency-Aware Task-Based Parallelization , 2015, SIAM J. Sci. Comput..

[11]  Salim Hariri,et al.  Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing , 2002, IEEE Trans. Parallel Distributed Syst..

[12]  John Shalf,et al.  Trends in Data Locality Abstractions for HPC Systems , 2017, IEEE Transactions on Parallel and Distributed Systems.

[13]  Emmanuel Agullo,et al.  Task‐based FMM for heterogeneous architectures , 2016, Concurr. Comput. Pract. Exp..

[14]  Emmanuel Agullo,et al.  Bridging the Gap Between OpenMP and Task-Based Runtime Systems for the Fast Multipole Method , 2017, IEEE Transactions on Parallel and Distributed Systems.

[15]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[16]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[17]  S. Guirchoun,et al.  Complexity results for parallel machine scheduling problems with a server in computer systems , 2008 .

[18]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[19]  Kostas Katrinis,et al.  A taxonomy of task-based parallel programming technologies for high-performance computing , 2018, The Journal of Supercomputing.

[20]  Bérenger Bramas,et al.  Optimization and parallelization of the boundary element method for the wave equation in time domain. (Optimisation et parallèlisation de la méthode des élements frontières pour l'équation des ondes dans le domaine temporel) , 2016 .

[21]  Fabrice Dupros,et al.  Task-Based Programming on Emerging Parallel Architectures for Finite-Differences Seismic Numerical Kernel , 2018, Euro-Par.

[22]  Hatem Ltaief,et al.  Asynchronous Task-Based Polar Decomposition on Single Node Manycore Architectures , 2018, IEEE Transactions on Parallel and Distributed Systems.

[23]  Bruno Raffin,et al.  XKaapi: A Runtime System for Data-Flow Task Programming on Heterogeneous Architectures , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[24]  Jesús Labarta,et al.  Dense Matrix Computations on NUMA Architectures with Distance-Aware Work Stealing , 2015, Supercomput. Front. Innov..

[25]  Suhaib A. Fahmy,et al.  Optimization of the HEFT Algorithm for a CPU-GPU Environment , 2013, 2013 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[26]  Philippe Baptiste,et al.  Constraint-based scheduling , 2001 .

[27]  Emmanuel Agullo,et al.  Task-Based Multifrontal QR Solver for GPU-Accelerated Multicore Architectures , 2015, 2015 IEEE 22nd International Conference on High Performance Computing (HiPC).