Heterogeneous Task Scheduling for Accelerated OpenMP

Heterogeneous systems with CPUs and computational accelerators such as GPUs, FPGAs or the upcoming Intel MIC are becoming mainstream. In these systems, peak performance includes the performance of not just the CPUs but also all available accelerators. In spite of this fact, the majority of programming models for heterogeneous computing focus on only one of these. With the development of Accelerated Open MP for GPUs, both from PGI and Cray, we have a clear path to extend traditional Open MP applications incrementally to use GPUs. The extensions are geared toward switching from CPU parallelism to GPU parallelism. However they do not preserve the former while adding the latter. Thus computational potential is wasted since either the CPU cores or the GPU cores are left idle. Our goal is to create a runtime system that can intelligently divide an accelerated Open MP region across all available resources automatically. This paper presents our proof-of-concept runtime system for dynamic task scheduling across CPUs and GPUs. Further, we motivate the addition of this system into the proposed Open MP for Accelerators standard. Finally, we show that this option can produce as much as a two-fold performance improvement over using either the CPU or GPU alone.

[1]  Bronis R. de Supinski,et al.  Adagio: making DVS practical for complex HPC applications , 2009, ICS.

[2]  James Reinders,et al.  Intel® threading building blocks , 2008 .

[3]  Long Chen,et al.  Dynamic load balancing on single- and multi-GPU systems , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[4]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[5]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[6]  Andrew T. Fenley,et al.  An analytical approach to computing biomolecular electrostatic potential. I. Derivation and analysis. , 2008, The Journal of chemical physics.

[7]  Wu-chun Feng,et al.  Reliable MapReduce computing on opportunistic resources , 2011, Cluster Computing.

[8]  Laxmikant V. Kalé,et al.  CHARM++: a portable concurrent object oriented system based on C++ , 1993, OOPSLA '93.

[9]  Benjamin Rose,et al.  CellMR: A framework for supporting mapreduce on asymmetric cell-based clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[10]  Michael Wolfe,et al.  Implementing the PGI Accelerator model , 2010, GPGPU-3.

[11]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[12]  Grigori Fursin,et al.  Predictive Runtime Code Scheduling for Heterogeneous Architectures , 2008, HiPEAC.

[13]  Andrew T. Fenley,et al.  An analytical approach to computing biomolecular electrostatic potential. II. Validation and applications. , 2008, The Journal of chemical physics.

[14]  Timlynn T. Babitsky,et al.  Proceedings of the eighth annual conference on Object-oriented programming systems, languages, and applications , 1993, OOPSLA 1993.

[15]  James P. Ahrens,et al.  Scout: a data-parallel programming language for graphics processors , 2007, Parallel Comput..

[16]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[17]  Alejandro Duran,et al.  Ompss: a Proposal for Programming Heterogeneous Multi-Core Architectures , 2011, Parallel Process. Lett..

[18]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[19]  Alejandro Duran,et al.  Is the Schedule Clause Really Necessary in OpenMP? , 2003, WOMPAT.

[20]  Wu-chun Feng,et al.  Accelerating electrostatic surface potential calculation with multi-scale approximation on graphics processing units. , 2010, Journal of molecular graphics & modelling.

[21]  Jack J. Dongarra,et al.  From CUDA to OpenCL: Towards a performance-portable solution for multi-platform GPU programming , 2012, Parallel Comput..

[22]  Bronis R. de Supinski,et al.  OpenMP for Accelerators , 2011, IWOMP.

[23]  Wu-chun Feng,et al.  Multi-dimensional characterization of temporal data mining on graphics processors , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.