Reverse Offload Programming on Heterogeneous Systems

To achieve high computation throughput, heterogeneous architectures utilize many special-purpose cores to work as floating point computing coprocessors. Popular programming models typically offload computing intensive operations to coprocessors and then aggregate the results. This approach results in the need of transferring a large amount of data via the peripheral component interconnect express (PCIe). To leverage the limited bandwidth of PCIe, we develop a reverse offload (rOffload) model that treats the autonomous Intel Many Integrated Core (MIC) coprocessor as the host processor while the CPU is treated as the coprocessor. The MICs orchestrate the computation and offload work, which cannot be accelerated on MIC, to the CPUs, thus reducing the overhead introduced by moving data among distinct memory regions. In this paper, we present an overview of rOffload, including the basic programming interface and its implementation on a CPU-MIC system. The results from benchmarking and from application experiments conducted on the Tianhe-2 supercomputer demonstrate the efficiency of our rOffload model in terms of programmability, portability, and performance.

[1]  Lars Koesterke,et al.  MPI and UPC broadcast, scatter and gather algorithms in Xeon Phi , 2016, Concurr. Comput. Pract. Exp..

[2]  John A. Gunnels,et al.  Petascale computing with accelerators , 2009, PPoPP '09.

[3]  Canqun Yang,et al.  A Fast Parallel Implementation of Molecular Dynamics with the Morse Potential on a Heterogeneous Petascale Supercomputer , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[4]  Xing Cai,et al.  Communication‐hiding programming for clusters with multi‐coprocessor nodes , 2015, Concurr. Comput. Pract. Exp..

[5]  Tao Tang,et al.  Orchestrating parallel detection of strongly connected components on GPUs , 2018, Parallel Comput..

[6]  Chao Yang,et al.  623 Tflop/s HPCG run on Tianhe-2: Leveraging millions of hybrid cores , 2016, Int. J. High Perform. Comput. Appl..

[7]  Avi Mendelson,et al.  Programming model for a heterogeneous x86 platform , 2009, PLDI '09.

[8]  Kim M. Hazelwood,et al.  Where is the data? Why you cannot debate CPU vs. GPU performance without the answer , 2011, (IEEE ISPASS) IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE.

[9]  Michael F. P. O'Boyle,et al.  Towards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mapping , 2009, PLDI '09.

[10]  Hyesoon Kim,et al.  Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping , 2009, 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[11]  Tao Tang,et al.  Streaming Applications on Heterogeneous Platforms , 2016, NPC.

[12]  Ieee Xiang,et al.  The TianHe-1A Supercomputer: Its Hardware and Software , 2011 .

[13]  Yi Yang,et al.  Semi-automatic restructuring of offloadable tasks for many-core accelerators , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[14]  Tao Tang,et al.  LU factorization on heterogeneous systems: an energy-efficient approach towards high performance , 2016, Computing.

[15]  Giuseppe Coviello,et al.  COSMIC: middleware for high performance and reliable multiprocessing on xeon phi coprocessors , 2013, HPDC '13.

[16]  Thomas Steinke,et al.  A Unified Programming Model for Intra- and Inter-Node Offloading on Xeon Phi Clusters , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[17]  Michael Lang,et al.  The reverse-acceleration model for programming petascale hybrid systems , 2009, IBM J. Res. Dev..

[18]  Dhabaleswar K. Panda,et al.  Efficient Intra-node Communication on Intel-MIC Clusters , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[19]  Samuel Williams,et al.  The Landscape of Parallel Computing Research: A View from Berkeley , 2006 .

[20]  Canqun Yang,et al.  Design and Implementation of a Highly Efficient DGEMM for 64-Bit ARMv8 Multi-core Processors , 2015, 2015 44th International Conference on Parallel Processing.

[21]  Pradeep Dubey,et al.  Design and Implementation of the Linpack Benchmark for Single and Multi-node Systems Based on Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing.

[22]  Alejandro Duran,et al.  Heterogeneous Streaming , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[23]  Jingling Xue,et al.  Efficient and accurate analytical modeling of whole-program data cache behavior , 2004, IEEE Transactions on Computers.

[24]  Yun Zhou,et al.  The Reliability Wall for Exascale Supercomputing , 2012, IEEE Transactions on Computers.

[25]  Dhabaleswar K. Panda,et al.  MVAPICH-PRISM: A proxy-based communication framework using InfiniBand and SCIF for Intel MIC clusters , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[26]  Canqun Yang,et al.  MilkyWay-2 supercomputer: system and application , 2014, Frontiers of Computer Science.

[27]  Ravi Narayanaswamy,et al.  Offload Compiler Runtime for the Intel® Xeon Phi Coprocessor , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[28]  Canqun Yang,et al.  HPCG: Preliminary Evaluation and Optimization on Tianhe-2 CPU-only Nodes , 2014, 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing.

[29]  Canqun Yang,et al.  FT-Offload: A Scalable Fault-Tolerance Programing Model on MIC Cluster , 2015, ICA3PP.

[30]  Peng Zhang,et al.  Auto-tuning Streamed Applications on Intel Xeon Phi , 2018, 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[31]  Jack Dongarra,et al.  A new metric for ranking high-performance computing systems , 2016, National Science Review.

[32]  Jack J. Dongarra,et al.  Portable HPC Programming on Intel Many-Integrated-Core Hardware with MAGMA Port to Xeon Phi , 2013, PPAM.

[33]  Tao Tang,et al.  Evaluating the Performance Impact of Multiple Streams on the MIC-Based Heterogeneous Platform , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW).

[34]  Dhabaleswar K. Panda,et al.  MVAPICH2-MIC: A High Performance MPI Library for Xeon Phi Clusters with InfiniBand , 2013, 2013 Extreme Scaling Workshop (xsw 2013).

[35]  Yi Yang,et al.  COMP: Compiler Optimizations for Manycore Processors , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[36]  Canqun Yang,et al.  Toward fault-tolerant hybrid programming over large-scale heterogeneous clusters via checkpointing/restart optimization , 2017, The Journal of Supercomputing.