HOMP: Automated Distribution of Parallel Loops and Data in Highly Parallel Accelerator-Based Systems

Heterogeneous computing systems, e.g., those with accelerators than the host CPUs, offer the accelerated performance for a variety of workloads. However, most parallel programming models require platform dependent, time-consuming hand-tuning efforts for collectively using all the resources in a system to achieve efficient results. In this work, we explore the use of OpenMP parallel language extensions to empower users with the ability to design applications that automatically and simultaneously leverage CPUs and accelerators to further optimize use of available resources. We believe such automation will be key to ensuring codes adapt to increases in the number and diversity of accelerator resources for future computing systems. The proposed system combines language extensions to OpenMP, load-balancing algorithms and heuristics, and a runtime system for loop distribution across heterogeneous processing elements. We demonstrate the effectiveness of our automated approach to program on systems with multiple CPUs, GPUs, and MICs.

[1]  Michael F. P. O'Boyle,et al.  Portable mapping of data parallel programs to OpenCL for heterogeneous systems , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[2]  Bronis R. de Supinski,et al.  Early Experiences with the OpenMP Accelerator Model , 2013, IWOMP.

[3]  David J. Lilja,et al.  Parallel Loop Scheduling for High Performance Computers , 1995 .

[4]  Thomas B. Jablin,et al.  Automatic execution of single-GPU computations across multiple GPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[5]  Gagan Agrawal,et al.  Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations , 2010, ICS '10.

[6]  Keshav Pingali,et al.  Adaptive heterogeneous scheduling for integrated GPUs , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[7]  Thierry Gautier,et al.  Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs , 2012, 2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing.

[8]  Roger W. Hockney,et al.  The Communication Challenge for MPP: Intel Paragon and Meiko CS-2 , 1994, Parallel Computing.

[9]  Yi Yang,et al.  Semi-automatic restructuring of offloadable tasks for many-core accelerators , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[10]  Bronis R. de Supinski,et al.  CoreTSAR: Adaptive Worksharing for Heterogeneous Systems , 2014, ISC.

[11]  Steven J. Deitz,et al.  User-defined distributions and layouts in chapel: philosophy and framework , 2010 .

[12]  Guy L. Steele,et al.  The High Performance Fortran Handbook , 1993 .

[13]  Bronis R. de Supinski,et al.  Supporting multiple accelerators in high-level programming models , 2015, PMAM '15.

[14]  Saman P. Amarasinghe,et al.  Portable performance on heterogeneous architectures , 2013, ASPLOS '13.

[15]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[16]  Uday Bondhugula,et al.  Automatic data allocation and buffer management for multi-GPU machines , 2013, TACO.

[17]  Michael F. P. O'Boyle,et al.  A Static Task Partitioning Approach for Heterogeneous Systems Using OpenCL , 2011, CC.

[18]  Hyesoon Kim,et al.  An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness , 2009, ISCA '09.

[19]  Christopher D. Carothers,et al.  Heterogeneous concurrent execution of Monte Carlo photon transport on CPU, GPU and MIC , 2014, IA3 '14.

[20]  Gregory Diamos,et al.  Harmony: an execution model and runtime for heterogeneous many core systems , 2008, HPDC '08.

[21]  Richard W. Vuduc,et al.  Tuned and wildly asynchronous stencil kernels for hybrid CPU/GPU systems , 2009, ICS.

[22]  Katherine Yelick,et al.  UPC: Distributed Shared-Memory Programming , 2003 .

[23]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[24]  Gagan Agrawal,et al.  A dynamic scheduling framework for emerging heterogeneous systems , 2011, 2011 18th International Conference on High Performance Computing.

[25]  William Gropp,et al.  An adaptive performance modeling tool for GPU architectures , 2010, PPoPP '10.

[26]  Jack J. Dongarra,et al.  Multi-GPU Implementation of LU Factorization , 2012, ICCS.

[27]  Scott A. Mahlke,et al.  SKMD: Single Kernel on Multiple Devices for Transparent CPU-GPU Collaboration , 2015, ACM Trans. Comput. Syst..

[28]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[29]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[30]  Sunita Chandrasekaran,et al.  Exploring Programming Multi-GPUs Using OpenMP and OpenACC-Based Hybrid Model , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.