Flexible Data Redistribution in a Task-Based Runtime System

Data redistribution aims to reshuffle data to optimize some objective for an algorithm. The objective can be multi-dimensional, such as improving computational load balance or decreasing communication volume or cost, with the ultimate goal to increase the efficiency and therefore decrease the time-to-solution for the algorithm. The classical redistribution problem focuses on optimally scheduling communications when reshuffling data between two regular, usually block-cyclic, data distributions. Recently, task-based runtime systems have gained popularity as a potential candidate to address the programming complexity on the way to exascale. In addition to an increase in portability against complex hardware and software systems, task-based runtime systems have the potential to be able to more easily cope with less-regular data distribution, providing a more balanced computational load during the lifetime of the execution. In this scenario, it becomes paramount to develop a general redistribution algorithm for task-based runtime systems, which could support all types of regular and irregular data distributions. In this paper, we detail a flexible redistribution algorithm, capable of dealing with redistribution problems without constraints of data distribution and data size and implement it in a task-based runtime system, PaRSEC. Performance results show great capability compared to ScaLAPACK, and applications highlight an increased efficiency with little overhead in terms of data distribution and data size.

[1]  Gudula Rünger,et al.  Fine-Grained Data Distribution Operations for Particle Codes , 2009, PVM/MPI.

[2]  Minyi Guo,et al.  A Framework for Efficient Data Redistribution on Distributed Memory Multicomputers , 2001, The Journal of Supercomputing.

[3]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[4]  Michael Metcalf,et al.  High performance Fortran , 1995 .

[5]  Thomas Hérault,et al.  PaRSEC: Exploiting Heterogeneity to Enhance Scalability , 2013, Computing in Science & Engineering.

[6]  Clément Foyer,et al.  ASPEN: An Efficient Algorithm for Data Redistribution Between Producer and Consumer Grids , 2018, Euro-Par Workshops.

[7]  Thomas Hérault,et al.  Dynamic task discovery in PaRSEC: a data-flow task-based runtime , 2017, ScalA@SC.

[8]  Hatem Ltaief,et al.  Extreme-Scale Task-Based Cholesky Factorization Toward Climate and Weather Prediction Applications , 2020, PASC.

[9]  David W. Walker,et al.  Redistribution of block-cyclic data distributions using MPI , 1996, Concurr. Pract. Exp..

[10]  Gudula Rünger,et al.  Flexible all‐to‐all data redistribution methods for grid‐based particle codes , 2018, Concurr. Comput. Pract. Exp..

[11]  Jack J. Dongarra,et al.  Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs , 2016, IEEE Transactions on Parallel and Distributed Systems.

[12]  Wei Wu,et al.  Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime Performance , 2019, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[13]  Francois Tessier,et al.  Automated Dynamic Data Redistribution , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[14]  Michael Wolfe,et al.  Optimization of Array Redistribution for Distributed Memory Multicomputers , 1995, Parallel Comput..

[15]  Jack J. Dongarra,et al.  Algorithmic Redistribution Methods for Block-Cyclic Decompositions , 1999, IEEE Trans. Parallel Distributed Syst..

[16]  Gudula Rünger,et al.  Efficient Data Redistribution Methods for Coupled Parallel Particle Codes , 2013, 2013 42nd International Conference on Parallel Processing.

[17]  Rajeev Thakur,et al.  Efficient Algorithms for Array Redistribution , 1996, IEEE Trans. Parallel Distributed Syst..

[18]  Bernard Tourancheau,et al.  Efficient Block Cyclic Data Redistribution , 1996, Euro-Par, Vol. I.

[19]  Rajesh Sudarsan,et al.  Efficient Multidimensional Data Redistribution for Resizable Parallel Computations , 2007, ISPA.

[20]  J. David Moulton,et al.  Scaling Structured Multigrid to 500K+ Cores through Coarse-Grid Redistribution , 2018, SIAM J. Sci. Comput..

[21]  Thomas Hérault,et al.  Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[22]  Jack Dongarra,et al.  Array Redistribution in ScaLAPACK Using PVM , 1995 .

[23]  Michael Wolfe,et al.  A New Approach to Array Redistribution: Strip Mining Redistribution , 1994, PARLE.

[24]  Thomas Hérault,et al.  PTG: An Abstraction for Unhindered Parallelism , 2014, 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing.

[25]  George Bosilca,et al.  PaRSEC : A programming paradigm exploiting heterogeneity for enhancing scalability , 2013 .

[26]  Yu Pei,et al.  Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools , 2019, 2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools).

[27]  Ching-Hsien Hsu,et al.  A Generalized Basic-Cycle Calculation Method for Efficient Array Redistribution , 2000, IEEE Trans. Parallel Distributed Syst..

[28]  Thomas Hérault,et al.  Assessing the cost of redistribution followed by a computational kernel: Complexity and performance results , 2016, Parallel Comput..

[29]  Viktor K. Prasanna,et al.  Efficient Algorithms for Block-Cyclic Redistribution of Arrays , 1999, Algorithmica.