Evaluation of the Intel Xeon Phi offload runtimes for domain decomposition solvers

Abstract In the paper we provide a comparison of several runtimes which can be used for offloading computationally intensive kernels to the Intel Xeon Phi coprocessors. The presented benchmark application is a stripped-down version of an iterative solver used within the Schur complement finite or boundary element tearing and interconnecting (FETI, BETI) domain decomposition methods where the sparse solve with local stiffness matrices is replaced by the multiplication with dense matrices in order to exploit coalesced memory access patterns. We present offload approaches based on the Intel Language Extension for Offload (LEO), Hetero Streams Library (hStreams), and Heterogeneous Active Messages (HAM), and compare their performance and ease of use.

[1]  Ravi Narayanaswamy,et al.  Offload Compiler Runtime for the Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[2]  James Reinders,et al.  High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches , 2014 .

[3]  Thomas Steinke,et al.  A Unified Programming Model for Intra- and Inter-Node Offloading on Xeon Phi Clusters , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[4]  Avinash Sodani,et al.  Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition 2nd Edition , 2016 .

[5]  Ondrej Meca,et al.  Intel Xeon Phi acceleration of Hybrid Total FETI solver , 2017, Adv. Eng. Softw..

[6]  Ravi Narayanaswamy,et al.  Offload Compiler Runtime for the Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[7]  James Reinders,et al.  Intel Xeon Phi Coprocessor High Performance Programming , 2013 .

[8]  Jim Jeffers,et al.  High Performance Parallelism Pearls Volume Two: Multicore and Many-core Programming Approaches , 2015 .