Weighted dynamic scheduling with many parallelism grains for offloading of numerical workloads to multiple varied accelerators

A wide variety of heterogeneous compute resources are available to modern computers, including multiple sockets containing multicore CPUs, one-or-more GPUs of varying power, and coprocessors such as the Intel Xeon Phi. The challenge faced by domain scientists is how to efficiently and productively use these varied resources. For example, in order to use GPUs effectively, the workload must have a greater degree of parallelism than a workload designed for a multicore-CPU. The domain scientist would have to design and schedule an application in multiple degrees of parallelism and task grain sizes in order to obtain efficient performance from the resources. We propose a productive programming model starting from serial code, which achieves parallelism and scalability by using a task-superscalar runtime environment to adapt the computation to the available resources. The adaptation is done at multiple points, including multi-level data partitioning, adaptive task grain sizes, and dynamic task scheduling. The effectiveness of this approach for utilizing multi-way heterogeneous hardware resources is demonstrated by implementing dense linear algebra applications.

[1]  Michael T. Goodrich,et al.  A bridging model for parallel computation, communication, and I/O , 1996, CSUR.

[2]  Jack J. Dongarra,et al.  High performance matrix inversion based on LU factorization for multicore architectures , 2011, MTAGS '11.

[3]  Monica S. Lam,et al.  Jade: a high-level, machine-independent language for parallel programming , 1993, Computer.

[4]  Jack J. Dongarra,et al.  Unified Development for Mixed Multi-GPU and Multi-coprocessor Environments Using a Lightweight Runtime Environment , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[5]  Asim YarKhan,et al.  Dynamic Task Execution on Shared and Distributed Memory Architectures , 2012 .

[6]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[7]  Julien Langou,et al.  The Impact of Multicore on Math Software , 2006, PARA.

[8]  Jack J. Dongarra,et al.  Achieving numerical accuracy and high performance using recursive tile LU factorization with partial pivoting , 2014, Concurr. Comput. Pract. Exp..

[9]  Jack J. Dongarra,et al.  Exploiting Fine-Grain Parallelism in Recursive LU Factorization , 2011, PARCO.

[10]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[11]  Basilio B. Fraguela,et al.  A framework for argument-based task synchronization with automatic detection of dependencies , 2013, Parallel Comput..

[12]  Yi Guo,et al.  The habanero multicore software research project , 2009, OOPSLA Companion.

[13]  Jack Dongarra,et al.  QUARK Users' Guide: QUeueing And Runtime for Kernels , 2011 .

[14]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[15]  Robert A. van de Geijn,et al.  Supermatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures , 2007, SPAA '07.

[16]  Jack J. Dongarra,et al.  Analysis of dynamically scheduled tile algorithms for dense linear algebra on multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[17]  P. Hanrahan,et al.  Sequoia: Programming the Memory Hierarchy , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[18]  Jack J. Dongarra,et al.  An Implementation of the Tile QR Factorization for a GPU and Multiple CPUs , 2010, PARA.

[19]  Bradley C. Kuszmaul,et al.  Cilk: an efficient multithreaded runtime system , 1995, PPOPP '95.

[20]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..