MPI and UPC broadcast, scatter and gather algorithms in Xeon Phi

Accelerators have revolutionised the high performance computing (HPC) community. Despite their advantages, their very specific programming models and limited communication capabilities have kept them in a supporting role of the main processors. With the introduction of Xeon Phi, this is no longer true, as it can be programmed as the main processor and has direct access to the InfiniBand network adapter. Collective operations play a key role in many HPC applications. Therefore, studying its behaviour in the context of manycore coprocessors has great importance. This work analyses the performance of different algorithms for broadcast, scatter and gather, in a large‐scale Xeon Phi supercomputer. The algorithms evaluated are those available in the reference message passing interface (MPI) implementation for Xeon Phi (Intel MPI), the default algorithm in an optimised MPI implementation (MVAPICH2‐MIC), and a new set of algorithms, developed by the authors of this work, designed with modern processors and new communication features in mind. The latter are implemented in Unified Parallel C (UPC), a partitioned global address space language, leveraging one‐sided communications, hierarchical trees and message pipelining. This study scales the experiments to 15360 cores in the Stampede supercomputer and compares the results to Xeon and hybrid Xeon + Xeon Phi experiments, with up to 19456 cores. Copyright © 2015 John Wiley & Sons, Ltd.

[1]  Ying Qian,et al.  Design and Evaluation of Efficient Collective Communications on Modern Interconnects and Multi-core Clusters , 2010 .

[2]  Dhabaleswar K. Panda,et al.  Efficient collective operations using remote memory operations on VIA-based clusters , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[3]  Katherine Yelick,et al.  Optimizing collective communication on multicores , 2009 .

[4]  Tarek A. El-Ghazawi,et al.  Benchmarking parallel compilers: A UPC case study , 2006, Future Gener. Comput. Syst..

[5]  Galen M. Shipman,et al.  X-SRQ- Improving Scalability and Performance of Multi-core InfiniBand Clusters , 2008, PVM/MPI.

[6]  Alejandro Rico,et al.  Tibidabo: Making the case for an ARM-based HPC system , 2014, Future Gener. Comput. Syst..

[7]  Dhabaleswar K. Panda,et al.  Scalable MPI design over InfiniBand using eXtended Reliable Connection , 2008, 2008 IEEE International Conference on Cluster Computing.

[8]  Jiulong Shan,et al.  Single Data Copying for MPI Communication Optimization on Shared Memory System , 2007, International Conference on Computational Science.

[9]  Amith R. Mamidala,et al.  Scaling alltoall collective on multi-core systems , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[10]  Dhabaleswar K. Panda,et al.  Efficient Intra-node Communication on Intel-MIC Clusters , 2013, 2013 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing.

[11]  Dhabaleswar K. Panda,et al.  GPU-Aware MPI on RDMA-Enabled Clusters: Design, Implementation and Evaluation , 2014, IEEE Transactions on Parallel and Distributed Systems.

[12]  Jorge González-Domínguez,et al.  Scalable PGAS collective operations in NUMA clusters , 2014, Cluster Computing.

[13]  Hyun-Wook Jin,et al.  High performance MPI-2 one-sided communication over InfiniBand , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..

[14]  Ashok Srinivasan,et al.  Optimization of Collective Communication in Intra-cell MPI , 2007, HiPC.

[15]  Dhabaleswar K. Panda,et al.  Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters , 2013, 2013 IEEE 21st Annual Symposium on High-Performance Interconnects.

[16]  Sabela Ramos,et al.  Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi , 2013, HPDC.

[17]  Torsten Hoefler,et al.  A practically constant-time MPI Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[18]  Nagiza F. Samatova,et al.  Processing MPI Derived Datatypes on Noncontiguous GPU-Resident Data , 2014, IEEE Transactions on Parallel and Distributed Systems.

[19]  Amith R. Mamidala,et al.  MPI Collectives on Modern Multicore Clusters: Performance Optimizations and Communication Characteristics , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[20]  Jianlong Zhong,et al.  Network Performance Aware MPI Collective Communication Operations in the Cloud , 2015, IEEE Transactions on Parallel and Distributed Systems.

[21]  Katherine A. Yelick,et al.  Tuning collective communication for Partitioned Global Address Space programming models , 2011, Parallel Comput..

[22]  Sathish S. Vadhiyar,et al.  Automatically Tuned Collective Communications , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[23]  Fernando Obelleiro Basteiro,et al.  High scalability multipole method. Solving half billion of unknowns , 2009, Computer Science - Research and Development.

[24]  Dhabaleswar K. Panda,et al.  MVAPICH-PRISM: A proxy-based communication framework using InfiniBand and SCIF for Intel MIC clusters , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[25]  Dhabaleswar K. Panda,et al.  Designing topology-aware collective communication algorithms for large scale InfiniBand clusters: Case studies with Scatter and Gather , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[26]  Abhinav Vishnu,et al.  A performance comparison of current HPC systems: Blue Gene/Q, Cray XE6 and InfiniBand systems , 2014, Future Gener. Comput. Syst..

[27]  Dhabaleswar K. Panda,et al.  Designing multi-leader-based Allgather algorithms for multi-core clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[28]  Kevin T. Pedretti,et al.  Optimizing Multi-core MPI Collectives with SMARTMAP , 2009, 2009 International Conference on Parallel Processing Workshops.

[29]  Xiaofang Zhao,et al.  Performance analysis and optimization of MPI collective operations on multi-core clusters , 2009, The Journal of Supercomputing.

[30]  Raymond Namyst,et al.  A multithreaded communication engine for multicore architectures , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[31]  Dhabaleswar K. Panda,et al.  UPC on MIC: Early Experiences with Native and Symmetric Modes , 2013 .

[32]  Robert A. van de Geijn,et al.  Collective communication on architectures that support simultaneous communication over multiple links , 2006, PPoPP '06.

[33]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[34]  D. Panda,et al.  High Performance RDMA Based All-to-All Broadcast for InfiniBand Clusters , 2005, HiPC.

[35]  Galen M. Shipman,et al.  MPI Support for Multi-core Architectures: Optimized Shared Memory Collectives , 2008, PVM/MPI.