Offload Compiler Runtime for the Intel® Xeon Phi Coprocessor

The Intel® Xeon Phi coprocessor platform has a new software stack that enables new programming models. One such model is offload of computation from a host processor to a coprocessor that is a fully-capable Intel® Architecture CPU, namely, the Intel® Xeon Phi coprocessor. The purpose of that offload is to improve response time and/or throughput. This paper presents the compiler offload software runtime infrastructure for the Intel® Xeon Phi coprocessor, which includes a production C/C++ and Fortran compiler that enables offload to that coprocessor, and an underlying Intel® Many Integrated Core (Intel® MIC) platform software stack that enables offloading. The paper shares the insights that grow out of the experience of a multi-year, intensive development effort. It addresses end users' questions about offload with the compiler offload runtime, namely, why offload to a co-processor is useful, how it is specified, and what the conditions for the profitability of offload are. It also serves as a guide to potential third-party developers of offload runtimes, such as a gcc-based offload compiler, ports of existing commercial offloading compilers to Intel® Xeon Phi coprocessor such as CAPS®, and third-party offload library vendors that Intel is working with, such as NAG® and MAGMA®. It describes the software architecture and design of the offload compiler runtime. It enumerates the key performance features for this heterogeneous computing stack, related to initializa-tion, data movement and invocation. Finally, it evaluates the performance impact of those features for a set of directed micro-benchmarks and larger workloads.

[1]  Yi Yang,et al.  Apricot: an optimizing compiler and productivity tool for x86-compatible many-core coprocessors , 2012, ICS '12.

[2]  Soonhoi Ha,et al.  Dynamic Code Overlay of SDF-Modeled Programs on Low-end Embedded Systems , 2006, Proceedings of the Design Automation & Test in Europe Conference.

[3]  Avi Mendelson,et al.  Programming model for a heterogeneous x86 platform , 2009, PLDI '09.

[4]  Jack Dongarra,et al.  Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects , 2009 .

[5]  Andrew Richards,et al.  Automatic Offloading of C++ for the Cell BE Processor: A Case Study Using Offload , 2010, 2010 International Conference on Complex, Intelligent and Software Intensive Systems.

[6]  Anthony Skjellum,et al.  Using MPI - portable parallel programming with the message-parsing interface , 1994 .

[7]  S. Schwartz,et al.  Properties of the working-set model , 1972, OPSR.

[8]  Jim Jeffers,et al.  Chapter 10 – Linux on the Coprocessor , 2013 .

[9]  Anthony Skjellum,et al.  Using MPI: portable parallel programming with the message-passing interface, 2nd Edition , 1999, Scientific and engineering computation series.

[10]  Georg Hager,et al.  Hybrid MPI and OpenMP Parallel Programming , 2006, PVM/MPI.

[11]  James Reinders,et al.  Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[12]  장훈,et al.  [서평]「Computer Organization and Design, The Hardware/Software Interface」 , 1997 .

[13]  Robert B. Ross,et al.  Using MPI-2: Advanced Features of the Message Passing Interface , 2003, CLUSTER.