MIC-RO: enabling efficient remote offload on heterogeneous many integrated core (MIC) clusters with InfiniBand

Xeon Phi, the latestMany Integrated Core (MIC) co-processor from Intel, packs up to 1 TFLOP of double precision performance in a single chip while providing x86 compatibility and supporting popular programming models like MPI and OpenMP. One of the easiest way to take advantage of the MIC is to use compiler directives to offoad appropriate compute tasks of an application. However, with the Xeon Phi being an expensive resource, it is believed that production systems will be designed in a heterogeneous manner with only a subset of compute nodes comprising the MIC co-processor. Moreover, not all applications will be able to take advantage of the complete compute power offered by a Xeon Phi. In such scenarios, the existing state-of-the-art frameworks which require applications to be scheduled on compute nodes that have the MIC co- processor, lead to inefficient utilization of the computing power offered by the MIC. In order to address this limitation, it is critical to design an efficient framework to facilitate applications to offload compute tasks on remote MICs. In this paper, we take on this challenge and design MIC-RO - a novel framework to enable efficient remote offload on heterogeneous MIC clusters. To the best of our knowledge, this is the first design that enables application scientists to offload computation to remote MICs. Our experimental results show that, using MIC-RO, applications are able to offload computation to remote MICs with no overhead compared to offloading on local MICs. Moreover, MIC-RO outperforms the default Intel compiler based offload techniques by up to a factor of two for multiple benchmarks and application kernels.