HAM - Heterogenous Active Messages for Efficient Offloading on the Intel Xeon Phi

The applicability of accelerators is limited by the attainable speed-up for the offloaded computations and by the offloading overheads. While GPU programming models like CUDA and OpenCL only allow to optimise the application code and its speed-up, the available low-level APIs for the Intel Xeon Phi provide opportunity to address the overheads, too. This work presents an Heterogeneous Active Message (HAM) layer that minimises software overheads for offloading on Intel’s Xeon Phi. It provides the basis for an offload API with similar semantics as the Intel Language Extensions for Offload (LEO). In contrast to LEO, HAM works within the C++ language and needs no additional compiler support. We evaluated HAM on top of SCIF and MPI as communication backends. While the SCIF backend offers the best performance, the MPI backend allows for inter-node offloads which are not possible with other offload solutions. Benchmark results show that the cost for offloading a function call can be decreased by a factor up to 18 compared with LEO.

[1]  Dhabaleswar K. Panda,et al.  Efficient Intra-node Communication on Intel-MIC Clusters , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[2]  Dietmar Fey,et al.  Using HPX and LibGeoDecomp for scaling HPC applications on heterogeneous supercomputers , 2013, ScalA '13.

[3]  James Reinders,et al.  Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[4]  Michael Lang,et al.  The reverse-acceleration model for programming petascale hybrid systems , 2009, IBM J. Res. Dev..

[5]  Ravi Narayanaswamy,et al.  Offload Compiler Runtime for the Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[6]  Dhabaleswar K. Panda,et al.  MVAPICH-PRISM: A proxy-based communication framework using InfiniBand and SCIF for Intel MIC clusters , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[7]  Mitsuhisa Sato,et al.  TACO: prototyping high-level object-oriented programming constructs by means of template based programming techniques , 2001, SIGP.

[8]  Yutaka Ishikawa,et al.  Direct MPI Library for Intel Xeon Phi Co-Processors , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[9]  Thomas L. Sterling,et al.  ParalleX An Advanced Parallel Execution Model for Scaling-Impaired Applications , 2009, 2009 International Conference on Parallel Processing Workshops.