Evaluating Data Redistribution in PaRSEC

Data redistribution aims to reshuffle data to optimize some objective for an algorithm. The objective can be multi-dimensional, such as improving computational load balance or decreasing communication volume or cost, with the ultimate goal of increasing the efficiency and therefore reducing the time-to-solution for the algorithm. The classic redistribution problem focuses on optimally scheduling communications when reshuffling data between two regular, usually block-cyclic, data distributions. Besides distribution, data size is also a performance-critical parameter because it affects the reshuffling algorithm in terms of cache, communication efficiency, and potential parallelism. In addition, task-based runtime systems have gained popularity recently as a potential candidate to address the programming complexity on the way to exascale. In this scenario, it becomes paramount to develop a flexible redistribution algorithm for task-based runtime systems, which could support all types of regular and irregular data distributions and take data size into account. In this article, we detail a flexible redistribution algorithm and implement an efficient approach in a task-based runtime system, PaRSEC. Performance results show great capability compared to the theoretical bound and ScaLAPACK, and applications highlight an increased efficiency with little overhead in terms of data distribution, data size, and data format.

[1]  Michael Metcalf,et al.  High performance Fortran , 1995 .

[2]  David E. Keyes,et al.  Leveraging PaRSEC Runtime Support to Tackle Challenging 3D Data-Sparse Matrix Problems , 2020, IEEE International Parallel and Distributed Processing Symposium.

[3]  Nathan T. Hjelm,et al.  Give MPI Threading a Fair Chance: A Study of Multithreaded MPI Designs , 2019, 2019 IEEE International Conference on Cluster Computing (CLUSTER).

[4]  Andrew James Mayfield,et al.  Adaptive mesh refinement , 1993 .

[5]  Yu Pei,et al.  Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools , 2019, 2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools).

[6]  Wei Wu,et al.  Flexible Data Redistribution in a Task-Based Runtime System , 2020, 2020 IEEE International Conference on Cluster Computing (CLUSTER).

[7]  Gudula Rünger,et al.  Flexible all‐to‐all data redistribution methods for grid‐based particle codes , 2018, Concurr. Comput. Pract. Exp..

[8]  Gudula Rünger,et al.  Fine-Grained Data Distribution Operations for Particle Codes , 2009, PVM/MPI.

[9]  Viktor K. Prasanna,et al.  Efficient Algorithms for Block-Cyclic Redistribution of Arrays , 1999, Algorithmica.

[10]  Alexander Aiken,et al.  Legion: Expressing locality and independence with logical regions , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[11]  George Bosilca,et al.  PaRSEC in Practice: Optimizing a Legacy Chemistry Application through Distributed Task-Based Execution , 2015, 2015 IEEE International Conference on Cluster Computing.

[12]  Siegfried Benkner,et al.  Implementing the Open Community Runtime for Shared-Memory and Distributed-Memory Systems , 2016, 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP).

[13]  Thomas Hérault,et al.  PaRSEC: Exploiting Heterogeneity to Enhance Scalability , 2013, Computing in Science & Engineering.

[14]  Asim YarKhan,et al.  Dynamic Task Execution on Shared and Distributed Memory Architectures , 2012 .

[15]  Bernard Tourancheau,et al.  Efficient Block Cyclic Data Redistribution , 1996, Euro-Par, Vol. I.

[16]  Elisabeth Larsson,et al.  A task parallel implementation of a scattered node stencil-based solver for the shallow water equations , 2013 .

[17]  Jack J. Dongarra,et al.  Implementation and Tuning of Batched Cholesky Factorization and Solve for NVIDIA GPUs , 2016, IEEE Transactions on Parallel and Distributed Systems.

[18]  William Gropp,et al.  DAME: A Runtime-Compiled Engine for Derived Datatypes , 2015, EuroMPI.

[19]  Thomas Hérault,et al.  Dynamic task discovery in PaRSEC: a data-flow task-based runtime , 2017, ScalA@SC.

[20]  Jack Dongarra,et al.  Array Redistribution in ScaLAPACK Using PVM , 1995 .

[21]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[22]  Susan Coghlan,et al.  Operating system issues for petascale systems , 2006, OPSR.

[23]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[24]  Thomas Hérault,et al.  PTG: An Abstraction for Unhindered Parallelism , 2014, 2014 Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing.

[25]  Sergio Iserte,et al.  DMRlib: Easy-coding and efficient resource management for job malleability , 2020 .

[26]  Gudula Rünger,et al.  Efficient Data Redistribution Methods for Coupled Parallel Particle Codes , 2013, 2013 42nd International Conference on Parallel Processing.

[27]  Clément Foyer,et al.  ASPEN: An Efficient Algorithm for Data Redistribution Between Producer and Consumer Grids , 2018, Euro-Par Workshops.

[28]  Robert J. Harrison,et al.  Distributed-memory multi-GPU block-sparse tensor contraction for electronic structure , 2020, 2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[29]  Ed Anderson,et al.  LAPACK Users' Guide , 1995 .

[30]  Jack Dongarra,et al.  QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .

[31]  Michael Wolfe,et al.  Optimization of Array Redistribution for Distributed Memory Multicomputers , 1995, Parallel Comput..

[32]  Jack Dongarra,et al.  Generic Matrix Multiplication for Multi-GPU Accelerated Distributed-Memory Platforms over PaRSEC , 2019, 2019 IEEE/ACM 10th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA).

[33]  Pradipta De,et al.  Impact of Noise on Scaling of Collectives: An Empirical Evaluation , 2006, HiPC.

[34]  Domenico Talia,et al.  ServiceSs: An Interoperable Programming Framework for the Cloud , 2013, Journal of Grid Computing.

[35]  Wei Wu,et al.  Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime Performance , 2019, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis.

[36]  Jack Dongarra,et al.  ScaLAPACK Users' Guide , 1987 .

[37]  Minyi Guo,et al.  A Framework for Efficient Data Redistribution on Distributed Memory Multicomputers , 2001, The Journal of Supercomputing.

[38]  J. David Moulton,et al.  Scaling Structured Multigrid to 500K+ Cores through Coarse-Grid Redistribution , 2018, SIAM J. Sci. Comput..

[39]  Philippe Olivier Alexandre Navaux,et al.  Performance Improvement of Stencil Computations for Multi-core Architectures based on Machine Learning , 2017, ICCS.

[40]  Ching-Hsien Hsu,et al.  A Generalized Basic-Cycle Calculation Method for Efficient Array Redistribution , 2000, IEEE Trans. Parallel Distributed Syst..

[41]  J. Ramanujam,et al.  Multi-phase array redistribution: modeling and evaluation , 1995, Proceedings of 9th International Parallel Processing Symposium.

[42]  Jack J. Dongarra,et al.  Algorithmic Redistribution Methods for Block-Cyclic Decompositions , 1999, IEEE Trans. Parallel Distributed Syst..

[43]  Viktor K. Prasanna,et al.  High-performance computing for vision , 1996, Proc. IEEE.

[44]  Rajeev Thakur,et al.  Efficient Algorithms for Array Redistribution , 1996, IEEE Trans. Parallel Distributed Syst..

[45]  Thomas Hérault,et al.  Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[46]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[47]  Armin R. Mikler,et al.  Net-PIPE: Network Protocol Independent Performance Evaluator , 1997 .

[48]  George Bosilca,et al.  Accelerating NWChem Coupled Cluster through dataflow-based execution , 2015, PPAM.

[49]  George Bosilca,et al.  PaRSEC: A programming paradigm exploiting heterogeneity for enhancing scalability , 2013 .

[50]  Thomas Heller,et al.  Application of the ParalleX execution model to stencil-based problems , 2013, Computer Science - Research and Development.

[51]  Samuel Thibault,et al.  MASA-StarPU: Parallel Sequence Comparison with Multiple Scheduling Policies and Pruning , 2020, 2020 IEEE 32nd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD).

[52]  Michael Wolfe,et al.  A New Approach to Array Redistribution: Strip Mining Redistribution , 1994, PARLE.

[53]  Torsten Hoefler,et al.  MPI datatype processing using runtime compilation , 2013, EuroMPI.

[54]  Alejandro Duran,et al.  A Proposal to Extend the OpenMP Tasking Model with Dependent Tasks , 2009, International Journal of Parallel Programming.

[55]  Jaeyoung Choi,et al.  Parallel Matrix Transpose Algorithms on Distributed Memory Concurrent Computers , 1995, Parallel Comput..

[56]  David W. Walker,et al.  Redistribution of block-cyclic data distributions using MPI , 1996, Concurr. Pract. Exp..

[57]  Rajesh Sudarsan,et al.  Efficient Multidimensional Data Redistribution for Resizable Parallel Computations , 2007, ISPA.

[58]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[59]  Francois Tessier,et al.  Automated Dynamic Data Redistribution , 2017, 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[60]  Thomas Hérault,et al.  Assessing the cost of redistribution followed by a computational kernel: Complexity and performance results , 2016, Parallel Comput..

[61]  Robert A. van de Geijn,et al.  Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures , 2007, SPAA '07.

[62]  Emmanuel Agullo,et al.  Achieving High Performance on Supercomputers with a Sequential Task-based Programming Model , 2017 .

[63]  George Bosilca,et al.  Taking Advantage of Hybrid Systems for Sparse Direct Solvers via Task-Based Runtimes , 2014, 2014 IEEE International Parallel & Distributed Processing Symposium Workshops.

[64]  Dan Tsafrir,et al.  System noise, OS clock ticks, and fine-grained parallel applications , 2005, ICS '05.