Communication Models for Distributed Intel Xeon Phi Coprocessors

The emergence of accelerator technology in current supercomputing systems is changing the landscape of supercom-puting architectures. Accelerators like GPGPUs and coprocessors are optimized for parallel computation while being more energy efficient. Their computational power per watt plays a crucial role in developing exaflop systems. Today's accelerators come with some limitations. They require a local host to configure and operate them. In addition, the number of host CPUs and accelerators does not scale independently. Another problem is the unbalanced communication between distributed accelerators. New communication frameworks are developed to optimize the internode communication. In this paper, four communication models using the Intel Xeon Phi coprocessor technology are compared. The Intel Xeon Phi coprocessor is based on the Intel Many Integrated Cores technology. It is an attractive accelerator due to its embedded Linux operating system, up to 1 TFLOPS of performance on a single chip, and its x86 64 compatibility. DCFA-MPI, MVAPICH2-MIC, and HAM-Offload are compared against the communication architecture for network-attached accelerators (NAA). Each communication model optimizes a different layer of the MIC communication architecture. The NAA approach makes the accelerator device independent from a local host system. Furthermore, it enables the accelerator to source and sink network traffic. Workloads can be dynamically assigned during run-time in an N to M ratio between CPUs and accelerators. The latency, bandwidth, and performance of the MPI communication layer of a prototype implementation are evaluated.

[1]  Xing Cai,et al.  Communication‐hiding programming for clusters with multi‐coprocessor nodes , 2015, Concurr. Comput. Pract. Exp..

[2]  Holger Fröning,et al.  GGAS: Global GPU address spaces for efficient communication in heterogeneous clusters , 2013, 2013 IEEE International Conference on Cluster Computing (CLUSTER).

[3]  Matthias Noack HAM - Heterogenous Active Messages for Efficient Offloading on the Intel Xeon Phi , 2014 .

[4]  Dhabaleswar K. Panda,et al.  MVAPICH2-MIC: A High Performance MPI Library for Xeon Phi Clusters with InfiniBand , 2013, 2013 Extreme Scaling Workshop (xsw 2013).

[5]  Holger Fröning,et al.  On Achieving High Message Rates , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[6]  Thomas Steinke,et al.  A Unified Programming Model for Intra- and Inter-Node Offloading on Xeon Phi Clusters , 2014, SC14: International Conference for High Performance Computing, Networking, Storage and Analysis.

[7]  G. Amdhal,et al.  Validity of the single processor approach to achieving large scale computing capabilities , 1967, AFIPS '67 (Spring).

[8]  Thomas Lippert,et al.  The DEEP Project - Pursuing Cluster-Computing in the Many-Core Era , 2013, 2013 42nd International Conference on Parallel Processing.

[9]  Yutaka Ishikawa,et al.  Direct MPI Library for Intel Xeon Phi Co-Processors , 2013, 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum.

[10]  Yutaka Ishikawa,et al.  Design of Direct Communication Facility for Many-Core Based Accelerators , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum.

[11]  Marshall C. Yovits,et al.  Ohio State University , 1974, SGAR.

[12]  John D. Owens,et al.  Message passing on data-parallel architectures , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[13]  Dhabaleswar K. Panda,et al.  MVAPICH-PRISM: A proxy-based communication framework using InfiniBand and SCIF for Intel MIC clusters , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[14]  Sayantan Sur,et al.  MVAPICH2-GPU: optimized GPU to GPU communication for InfiniBand clusters , 2011, Computer Science - Research and Development.

[15]  Pavan Balaji,et al.  MT-MPI: multithreaded MPI for many-core environments , 2014, ICS '14.

[16]  Holger Fröning,et al.  Efficient hardware support for the Partitioned Global Address Space , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[17]  Ulrich Brüning,et al.  Scalable communication architecture for network-attached accelerators , 2015, 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA).