A Unified Programming Model for Intra- and Inter-Node Offloading on Xeon Phi Clusters

Standard offload programming models for the Xeon Phi, e.g. Intel LEO and OpenMP 4.0, are restricted to a single compute node and hence a limited number of coprocessors. Scaling applications across a Xeon Phi cluster/supercomputer thus requires hybrid programming approaches, usually MPI+X. In this work, we present a framework based on heterogeneous active messages (HAM-Offload) that provides the means to offload work to local and remote (co)processors using a unified offload API. Since HAM-Offload provides similar primitives as current local offload frameworks, existing applications can be easily ported to overcome the single-node limitation while keeping the convenient offload programming model. We demonstrate the effectiveness of the framework by using it to enable a real-world application from the field of molecular dynamics to use multiple local and remote Xeon Phis. The evaluation shows good scaling behavior. Compared with LEO, performance is equal for large offloads and significantly better for small offloads.

[1]  Federico Silla,et al.  rCUDA: Reducing the number of GPU-based accelerators in high performance clusters , 2010, 2010 International Conference on High Performance Computing & Simulation.

[2]  Ravi Narayanaswamy,et al.  Offload Compiler Runtime for the Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[3]  Matthias Noack HAM - Heterogenous Active Messages for Efficient Offloading on the Intel Xeon Phi , 2014 .

[4]  Mitsuhisa Sato,et al.  TACO: prototyping high-level object-oriented programming constructs by means of template based programming techniques , 2001, SIGP.

[5]  Dhabaleswar K. Panda,et al.  Efficient Intra-node Communication on Intel-MIC Clusters , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[6]  Wu-chun Feng,et al.  VOCL: An optimized environment for transparent virtualization of graphics processing units , 2012, 2012 Innovative Parallel Computing (InPar).

[7]  Amnon Barak,et al.  A package for OpenCL based heterogeneous computing on clusters with many GPU devices , 2010, 2010 IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS).

[8]  Johannes Schmidt-Ehrenberg,et al.  Metastable Conformations via successive Perron-Cluster Cluster Analysis of dihedrals , 2002 .

[9]  Ravi Narayanaswamy,et al.  Offload Compiler Runtime for the Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[10]  Dhabaleswar K. Panda,et al.  MVAPICH-PRISM: A proxy-based communication framework using InfiniBand and SCIF for Intel MIC clusters , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[11]  William H. Press,et al.  Numerical Recipes 3rd Edition: The Art of Scientific Computing , 2007 .

[12]  Wen-mei W. Hwu,et al.  GPU Computing Gems Jade Edition , 2011 .

[13]  Wen-mei W. Hwu,et al.  GPU Computing Gems Emerald Edition , 2011 .

[14]  Michael Lang,et al.  The reverse-acceleration model for programming petascale hybrid systems , 2009, IBM J. Res. Dev..

[15]  Aaftab Munshi,et al.  The OpenCL specification , 2009, 2009 IEEE Hot Chips 21 Symposium (HCS).

[16]  James Reinders,et al.  Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[17]  Thomas A. Halgren Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94 , 1996, J. Comput. Chem..

[18]  Yutaka Ishikawa,et al.  Direct MPI Library for Intel Xeon Phi Co-Processors , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[19]  Unix System Laboratories System V Application Binary Interface , 1993 .

[20]  T. Steinke,et al.  On Improving the Performance of Multi-threaded CUDA Applications with Concurrent Kernel Execution by Kernel Reordering , 2012, 2012 Symposium on Application Accelerators in High Performance Computing.

[21]  Dhabaleswar K. Panda,et al.  MIC-RO: enabling efficient remote offload on heterogeneous many integrated core (MIC) clusters with InfiniBand , 2013, ICS '13.

[22]  Michael Klemm,et al.  From GPGPU to Many-Core: Nvidia Fermi and Intel Many Integrated Core Architecture , 2012, Computing in Science & Engineering.