OpenMP Memkind: An Extension for Heterogeneous Physical Memories

Recently, CPU and graphics processors have been increasing the degree of on-chip parallelism in order to combat the decrease in traditional Moore's Law scaling. As a result, these new processors are increasing their appetite for faster memory devices with higher bandwidth. Component manufacturers have resorted to disparate or hierarchical fast memory device architectures such as shared local memory (SLM), scratch pad memory (SPM), and high bandwidth memory (HBM) to provide sufficient bandwidth. Following this trend, the physical memory locality gradually becomes a performance feature that users would like to explicitly manage. Inspired by this idea, this research is conducted to create a heterogeneous memory interface based on a new declarative data storage directive, or "memkind", for the OpenMP parallel programming specification to explicitly manage physical memory locality. Our approach is implemented as an OpenMP directive in order to avoid allocating data inside parallel regions, thus avoiding performance degradation due to sequential operating system routines. We demonstrate our approach as an extension to the LLVM OpenMP implementation, that enables the portability of our approach to be rapidly ported to any LLVM-supported architecture target. Our contributions in this work are a detailed design analysis of the memkind directive as well as a detailed implementation in the LLVM compiler infrastructure. We demonstrate the efficacy of our approach using a synthetic benchmark application that records the execution performance and memory allocation efficiency.

[1]  Kevin O'Brien,et al.  Integrating GPU support for OpenMP offloading directives into Clang , 2015, LLVM '15.

[2]  Howard W. Wong The OpenMP Optimization Framework: Using OpenMP to Optimize Structure Grid HPC Applications , 2013 .

[3]  Rajesh K. Gupta,et al.  Reliability and Performance Trade-off Study of Heterogeneous Memories , 2016, MEMSYS.

[4]  Albert Cohen,et al.  OpenStream: Expressiveness and data-flow compilation of OpenMP streaming programs , 2012, TACO.

[5]  Michael F. P. O'Boyle,et al.  Portable mapping of data parallel programs to OpenCL for heterogeneous systems , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[6]  Sunita Chandrasekaran,et al.  Compiling a High-Level Directive-Based Programming Model for GPGPUs , 2013, LCPC.

[7]  S. Phadke,et al.  MLP aware heterogeneous memory system , 2011, 2011 Design, Automation & Test in Europe.

[8]  Pascal Junod,et al.  Obfuscator-LLVM -- Software Protection for the Masses , 2015, 2015 IEEE/ACM 1st International Workshop on Software Protection.

[9]  Jiri Kraus,et al.  Accelerating a C++ CFD Code with OpenACC , 2014, 2014 First Workshop on Accelerator Programming using Directives.

[10]  Bronis R. de Supinski,et al.  Supporting multiple accelerators in high-level programming models , 2015, PMAM '15.

[11]  R. Govindarajan,et al.  Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices , 2014, CGO '14.

[12]  David Roberts,et al.  Heterogeneous memory architectures: A HW/SW approach for mixing die-stacked and off-package memories , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).

[13]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[14]  Barbara Chapman,et al.  Using OpenMP - portable shared memory parallel programming , 2007, Scientific and engineering computation.

[15]  Luca Benini,et al.  An OpenMP Compiler for Efficient Use of Distributed Scratchpad Memory in MPSoCs , 2012, IEEE Transactions on Computers.

[16]  Ryan Newton,et al.  Type-safe runtime code generation: accelerate to LLVM , 2015, Haskell.

[17]  John M. Mellor-Crummey,et al.  Effective sampling-driven performance tools for GPU-accelerated supercomputers , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).