IMPACC: A Tightly Integrated MPI+OpenACC Framework Exploiting Shared Memory Parallelism

We propose IMPACC, an MPI+OpenACC framework for heterogeneous accelerator clusters. IMPACC tightly integrates MPI and OpenACC, while exploiting the shared memory parallelism in the target system. IMPACC dynamically adapts the input MPI+OpenACC applications on the target heterogeneous accelerator clusters to fully exploit target system-specific features. IMPACC provides the programmers with the unified virtual address space, automatic NUMA-friendly task-device mapping, efficient integrated communication routines, seamless streamlining of asynchronous executions, and transparent memory sharing. We have implemented IMPACC and evaluated its performance using three heterogeneous accelerator systems, including Titan supercomputer. Results show that IMPACC can achieve easier programming, higher performance, and better scalability than the current MPI+OpenACC model.

[1]  Tao Yang,et al.  Compile/run-time support for threaded MPI execution on multiprogrammed shared memory machines , 1999, PPoPP '99.

[2]  Cédric Augonnet,et al.  StarPU-MPI: Task Programming over Clusters of Machines Enhanced with Accelerators , 2012, EuroMPI.

[3]  Franck Cappello,et al.  MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[4]  Wu-chun Feng,et al.  MPI-ACC: An Integrated and Extensible Approach to Data Movement in Accelerator-based Systems , 2012, 2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems.

[5]  Jaejin Lee,et al.  Performance characterization of the NAS Parallel Benchmarks in OpenCL , 2011, 2011 IEEE International Symposium on Workload Characterization (IISWC).

[6]  David H. Bailey,et al.  The NAS parallel benchmarks summary and preliminary results , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[7]  David Tarditi,et al.  Accelerator: using data parallelism to program GPUs for general-purpose uses , 2006, ASPLOS XII.

[8]  Torsten Hoefler,et al.  Hybrid MPI: Efficient message passing for multi-core systems , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[9]  Eduard Ayguadé,et al.  An Extension of the StarSs Programming Model for Platforms with Multiple GPUs , 2009, Euro-Par.

[10]  Alejandro Duran,et al.  Productive Programming of GPU Clusters with OmpSs , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[11]  Mel Gorman,et al.  Understanding the Linux Virtual Memory Manager , 2004 .

[12]  Steven L. Scott,et al.  Synchronization and communication in the T3E multiprocessor , 1996, ASPLOS VII.

[13]  Rajeev Thakur,et al.  Enabling MPI interoperability through flexible communication endpoints , 2013, EuroMPI.

[14]  Georg Hager,et al.  Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[15]  Brice Goglin,et al.  KNEM: A generic and scalable kernel-assisted intra-node MPI communication framework , 2013, J. Parallel Distributed Comput..

[16]  Federico Silla,et al.  Enabling CUDA acceleration within virtual machines using rCUDA , 2011, 2011 18th International Conference on High Performance Computing.

[17]  Xingfu Wu,et al.  Performance characteristics of hybrid MPI/OpenMP implementations of NAS parallel benchmarks SP and BT on large-scale multicore supercomputers , 2011, PERV.

[18]  Eduard Ayguadé,et al.  OmpSs-OpenCL Programming Model for Heterogeneous Systems , 2012, LCPC.

[19]  Seyong Lee,et al.  Early evaluation of directive-based GPU programming models for productive exascale computing , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[20]  Rajeev Thakur,et al.  Fine-Grained Multithreading Support for Hybrid Threaded MPI Programming , 2010, Int. J. High Perform. Comput. Appl..

[21]  Henri E. Bal,et al.  Cashmere: Heterogeneous Many-Core Computing , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium.

[22]  Thomas Steinke,et al.  A Unified Programming Model for Intra- and Inter-Node Offloading on Xeon Phi Clusters , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[23]  Jeffrey S. Vetter,et al.  Performance Implications of Nonuniform Device Topologies in Scalable Heterogeneous Architectures , 2011, IEEE Micro.

[24]  Rudolf Eigenmann,et al.  A hybrid approach of OpenMP for clusters , 2012, PPoPP '12.

[25]  Michael Garland,et al.  Designing a unified programming model for heterogeneous machines , 2012, 2012 International Conference for High Performance Computing, Networking, Storage and Analysis.

[26]  Jack J. Dongarra,et al.  Unified Development for Mixed Multi-GPU and Multi-coprocessor Environments Using a Lightweight Runtime Environment , 2014, 2014 IEEE 28th International Parallel and Distributed Processing Symposium.

[27]  Thomas Fahringer,et al.  LibWater: heterogeneous distributed computing made easy , 2013, ICS '13.

[28]  Dhabaleswar K. Panda,et al.  MVAPICH-PRISM: A proxy-based communication framework using InfiniBand and SCIF for Intel MIC clusters , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[29]  Seyong Lee,et al.  OpenARC: open accelerator research compiler for directive-based, efficient heterogeneous computing , 2014, HPDC '14.

[30]  Géraud Krawezik Performance comparison of MPI and three openMP programming styles on shared memory multiprocessors , 2003, SPAA '03.

[31]  Ian Karlin,et al.  LULESH Programming Model and Performance Ports Overview , 2012 .

[32]  Torsten Hoefler,et al.  Ownership passing: efficient distributed memory programming on multi-core systems , 2013, PPoPP '13.

[33]  Jeffrey S. Vetter,et al.  Contemporary High Performance Computing - From Petascale toward Exascale , 2019, Chapman and Hall / CRC computational science series.

[34]  Cédric Augonnet,et al.  StarPU: a unified platform for task scheduling on heterogeneous multicore architectures , 2011, Concurr. Comput. Pract. Exp..

[35]  Sayantan Sur,et al.  MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters , 2011, Computer Science - Research and Development.

[36]  Saman P. Amarasinghe,et al.  Portable performance on heterogeneous architectures , 2013, ASPLOS '13.

[37]  Pavan Balaji,et al.  MT-MPI: multithreaded MPI for many-core environments , 2014, ICS '14.

[38]  John E. Stone,et al.  An asymmetric distributed shared memory model for heterogeneous parallel systems , 2010, ASPLOS XV.

[39]  Pradeep Dubey,et al.  Beacon: Deployment and Application of Intel Xeon Phi Coprocessorsfor Scientific Computing , 2015, Comput. Sci. Eng..

[40]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[41]  Jungwon Kim,et al.  SnuCL: an OpenCL framework for heterogeneous CPU/GPU clusters , 2012, ICS '12.

[42]  Rosa M. Badia,et al.  CellSs: a Programming Model for the Cell BE Architecture , 2006, ACM/IEEE SC 2006 Conference (SC'06).