Hierarchical DAG Scheduling for Hybrid Distributed Systems

Accelerator-enhanced computing platforms have drawn a lot of attention due to their massive peak commutational capacity. Despite significant advances in the programming interfaces to such hybrid architectures, traditional programming paradigms struggle with mapping the resulting multi-dimensional heterogeneity and the expression of algorithm parallelism, resulting in sub-optimal effective performance. Task-based programming paradigms have the capability to alleviate some of the programming challenges on distributed hybrid many-core architectures. In this paper we take this concept a step further by showing that the potential of task-based programming paradigms can be greatly increased with minimal modification of the underlying runtime combined with the right algorithmic changes. We propose two novel recursive algorithmic variants for one-sided factorizations and describe the changes to the PaRSEC task-scheduling runtime to build a framework where the task granularity is dynamically adjusted to adapt the degree of available parallelism and kernel efficiency according to runtime conditions. Based on an extensive set of results, we show that, with one-sided factorizations, i.e. Colicky, and QR, a carefully written algorithm, supported by an adaptive tasks-based runtime, is capable of reaching a degree of performance and scalability never achieved before in distributed hybrid environments.

[1]  Jack J. Dongarra,et al.  Enabling and scaling matrix computations on heterogeneous multi-core and multi-GPU systems , 2012, ICS '12.

[2]  Bruno Raffin,et al.  Preliminary Experiments with XKaapi on Intel Xeon Phi Coprocessor , 2013, 2013 25th International Symposium on Computer Architecture and High Performance Computing.

[3]  Thomas Hérault,et al.  Performance Portability of a GPU Enabled Factorization with the DAGuE Framework , 2011, 2011 IEEE International Conference on Cluster Computing.

[4]  Robert A. van de Geijn,et al.  Satisfying your dependencies with SuperMatrix , 2007, 2007 IEEE International Conference on Cluster Computing.

[5]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[6]  Emmanuel Agullo,et al.  QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[7]  Robert A. van de Geijn,et al.  Solving dense linear systems on platforms with multiple hardware accelerators , 2009, PPoPP '09.

[8]  Jack J. Dongarra,et al.  Dense linear algebra solvers for multicore with GPU accelerators , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[9]  Robert A. van de Geijn,et al.  Retargeting PLAPACK to clusters with hardware accelerators , 2010, 2010 International Conference on High Performance Computing & Simulation.

[10]  Victor Eijkhout,et al.  Dense Matrix Computation on a Heterogenous Architecture: A Block Synchronous Approach , 2012 .

[11]  Julien Langou,et al.  A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures , 2007, Parallel Comput..

[12]  James Demmel,et al.  Reconstructing Householder Vectors from Tall-Skinny QR , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[13]  Reiji Suda,et al.  Autotuning Method for Deciding Block Size Parameters in Dynamically Load-Balanced BLAS , 2010, Software Automatic Tuning, From Concepts to State-of-the-Art Results.

[14]  Thomas Hérault,et al.  DAGuE: A Generic Distributed DAG Engine for High Performance Computing , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.