Kernel-assisted and topology-aware MPI collective communications on multicore/many-core platforms

Abstract Multicore Clusters, which have become the most prominent form of High Performance Computing (HPC) systems, challenge the performance of MPI applications with non-uniform memory accesses and shared cache hierarchies. Recent advances in MPI collective communications have alleviated the performance issue exposed by deep memory hierarchies by carefully considering the mapping between the collective topology and the hardware topologies, as well as the use of single-copy kernel assisted mechanisms. However, on distributed environments, a single level approach cannot encompass the extreme variations not only in bandwidth and latency capabilities, but also in the capability to support duplex communications or operate multiple concurrent copies. This calls for a collaborative approach between multiple layers of collective algorithms, dedicated to extracting the maximum degree of parallelism from the collective algorithm by consolidating the intra- and inter-node communications. In this work, we present HierKNEM, a kernel-assisted topology-aware collective framework, and the mechanisms deployed by this framework to orchestrate the collaboration between multiple layers of collective algorithms. The resulting scheme maximizes the overlap of intra- and inter-node communications. We demonstrate experimentally, by considering three of the most used collective operations (Broadcast, Allgather and Reduction), that (1) this approach is immune to modifications of the underlying process-core binding; (2) it outperforms state-of-art MPI libraries (Open MPI, MPICH2 and MVAPICH2) demonstrating up to a 30x speedup for synthetic benchmarks, and up to a 3x acceleration for a parallel graph application (ASP); (3) it furthermore demonstrates a linear speedup with the increase of the number of cores per compute node, a paramount requirement for scalability on future many-core hardware.

[1]  Kevin T. Pedretti,et al.  SMARTMAP: Operating system support for efficient data sharing among processes on a multi-core processor , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[2]  Amith R. Mamidala,et al.  Efficient Shared Memory and RDMA Based Design for MPI_Allgather over InfiniBand , 2006, PVM/MPI.

[3]  Hao Zhu,et al.  Hierarchical Collectives in MPICH2 , 2009, PVM/MPI.

[4]  Ron Brightwell,et al.  Exploiting Direct Access Shared Memory for MPI On Multi-Core Processors , 2010, Int. J. High Perform. Comput. Appl..

[5]  Wenguang Chen,et al.  MPIPP: an automatic profile-guided parallel process placement toolset for SMP clusters and multiclusters , 2006, ICS '06.

[6]  Lars Paul Huse Collective Communication on Dedicated Clusters of Workstations , 1999, PVM/MPI.

[7]  Brice Goglin,et al.  Dodging Non-uniform I/O Access in Hierarchical Collective Operations for Multicore Clusters , 2011, 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum.

[8]  Guillaume Mercier,et al.  Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis , 2009, 2009 International Conference on Parallel Processing.

[9]  George Bosilca,et al.  Kernel Assisted Collective Intra-node MPI Communication among Multi-Core and Many-Core CPUs , 2011, 2011 International Conference on Parallel Processing.

[10]  Thomas Hérault,et al.  Process Distance-Aware Adaptive MPI Collective Communications , 2011, 2011 IEEE International Conference on Cluster Computing.

[11]  Ying Qian,et al.  RDMA-based and SMP-aware Multi-port All-Gather on Multi-rail QsNet^II SMP Clusters , 2007, 2007 International Conference on Parallel Processing (ICPP 2007).

[12]  Sayantan Sur,et al.  LiMIC: support for high-performance MPI intra-node communication on Linux cluster , 2005, 2005 International Conference on Parallel Processing (ICPP'05).

[13]  Bronis R. de Supinski,et al.  Exploiting hierarchy in parallel computer networks to optimize collective operation performance , 2000, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000.

[14]  George Bosilca,et al.  HierKNEM: An Adaptive Framework for Kernel-Assisted and Topology-Aware Collective Communications on Many-core Clusters , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[15]  Rajeev Thakur,et al.  Improving the Performance of Collective Operations in MPICH , 2003, PVM/MPI.

[16]  Amith R. Mamidala,et al.  Efficient SMP-aware MPI-level broadcast over InfiniBand's hardware multicast , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[17]  Bronis R. de Supinski,et al.  A Multilevel Approach to Topology-Aware Collective Operations in Computational Grids , 2002, ArXiv.

[18]  Dhabaleswar K. Panda,et al.  Designing topology-aware collective communication algorithms for large scale InfiniBand clusters: Case studies with Scatter and Gather , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[19]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[20]  Dhabaleswar K. Panda,et al.  Designing multi-leader-based Allgather algorithms for multi-core clusters , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.