Concurrent Kernel Execution on Xeon Phi within Parallel Heterogeneous Workloads

Computations with a sufficient amount of parallelism and workload size may take advantage of many-core coprocessors. In contrast, small-scale workloads usually suffer from a poor utilization of the coprocessor resources. For parallel applications with small but many computational kernels a concurrent processing on a shared coprocessor may be a viable solution. We evaluate the Xeon Phi offload models Intel LEO and OpenMP4 within multi-threaded and multi-process host applications with concurrent coprocessor offloading. Limitations of OpenMP4 regarding data persistence across function calls, e.g. when used within libraries, can slow down the application. We propose an offload-proxy approach for OpenMP4 to recover the performance in these cases. For concurrent kernel execution, we demonstrate the performance of the different offload models and our offload-proxy by using synthetic kernels and a parallel hybrid CPU/Xeon Phi molecular simulation application.