Scalable PGAS collective operations in NUMA clusters

The increasing number of cores per processor is turning manycore-based systems in pervasive. This involves dealing with multiple levels of memory in non uniform memory access (NUMA) systems and processor cores hierarchies, accessible via complex interconnects in order to dispatch the increasing amount of data required by the processing elements. The key for efficient and scalable provision of data is the use of collective communication operations that minimize the impact of bottlenecks. Leveraging one sided communications becomes more important in these systems, to avoid unnecessary synchronization between pairs of processes in collective operations implemented in terms of two sided point to point functions. This work proposes a series of algorithms that provide a good performance and scalability in collective operations, based on the use of hierarchical trees, overlapping one-sided communications, message pipelining and the available NUMA binding features. An implementation has been developed for Unified Parallel C, a Partitioned Global Address Space language, which presents a shared memory view across the nodes for programmability, while keeping private memory regions for performance. The performance evaluation of the proposed implementation, conducted on five representative systems (JuRoPA, JUDGE, Finis Terrae, SVG and Superdome), has shown generally good performance and scalability, even outperforming MPI in some cases, which confirms the suitability of the developed algorithms for manycore architectures.

[1]  Emmanuel Jeannot,et al.  Near-Optimal Placement of MPI Processes on Hierarchical NUMA Architectures , 2010, Euro-Par.

[2]  Robert A. van de Geijn,et al.  Collective communication on architectures that support simultaneous communication over multiple links , 2006, PPoPP '06.

[3]  Dhabaleswar K. Panda,et al.  Designing multi-leader-based Allgather algorithms for multi-core clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[4]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[5]  Joseph Antony,et al.  Exploring Thread and Memory Placement on NUMA Architectures: Solaris and Linux, UltraSPARC/FirePlane and Opteron/HyperTransport , 2006, HiPC.

[6]  Ashok Srinivasan,et al.  Optimization of Collective Communication in Intra-cell MPI , 2007, HiPC.

[7]  Galen M. Shipman,et al.  X-SRQ- Improving Scalability and Performance of Multi-core InfiniBand Clusters , 2008, PVM/MPI.

[8]  Katherine A. Yelick,et al.  Tuning collective communication for Partitioned Global Address Space programming models , 2011, Parallel Comput..

[9]  Kevin M. Lepak,et al.  Cache Hierarchy and Memory Subsystem of the AMD Opteron Processor , 2010, IEEE Micro.

[10]  Katherine Yelick,et al.  Optimizing collective communication on multicores , 2009 .

[11]  Ahmed Sameh,et al.  Potential Performance Improvement of Collective Operations in UPC , 2007, PARCO.

[12]  Hyun-Wook Jin,et al.  High performance MPI-2 one-sided communication over InfiniBand , 2004, IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004..

[13]  Torsten Hoefler,et al.  NUMA-aware shared-memory collective communication for MPI , 2013, HPDC.

[14]  Ying Qian,et al.  Design and Evaluation of Efficient Collective Communications on Modern Interconnects and Multi-core Clusters , 2010 .

[15]  Fernando Obelleiro Basteiro,et al.  High scalability multipole method. Solving half billion of unknowns , 2009, Computer Science - Research and Development.

[16]  Dhabaleswar K. Panda,et al.  Scalable MPI design over InfiniBand using eXtended Reliable Connection , 2008, 2008 IEEE International Conference on Cluster Computing.

[17]  Francisco F. Rivera,et al.  On the Influence of Thread Allocation for Irregular Codes in NUMA Systems , 2009, 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies.

[18]  Galen M. Shipman,et al.  MPI Support for Multi-core Architectures: Optimized Shared Memory Collectives , 2008, PVM/MPI.

[19]  Jiulong Shan,et al.  Single Data Copying for MPI Communication Optimization on Shared Memory System , 2007, International Conference on Computational Science.

[20]  Amith R. Mamidala,et al.  Scaling alltoall collective on multi-core systems , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[21]  Kevin T. Pedretti,et al.  Optimizing Multi-core MPI Collectives with SMARTMAP , 2009, 2009 International Conference on Parallel Processing Workshops.

[22]  Torsten Hoefler,et al.  A practically constant-time MPI Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[23]  Raymond Namyst,et al.  A multithreaded communication engine for multicore architectures , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[24]  Dhabaleswar K. Panda,et al.  Designing topology-aware collective communication algorithms for large scale InfiniBand clusters: Case studies with Scatter and Gather , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[25]  Xiaofang Zhao,et al.  Performance analysis and optimization of MPI collective operations on multi-core clusters , 2009, The Journal of Supercomputing.

[26]  Amith R. Mamidala,et al.  MPI Collectives on Modern Multicore Clusters: Performance Optimizations and Communication Characteristics , 2008, 2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID).

[27]  Sathish S. Vadhiyar,et al.  Automatically Tuned Collective Communications , 2000, ACM/IEEE SC 2000 Conference (SC'00).