Optimizing MPI communication within large multicore nodes with kernel assistance

As the number of cores per node increases in modern clusters, intra-node communication efficiency becomes critical to application performance. We present a study of the traditional double-copy model in MPICH2 and a kernel-assisted single-copy strategy with KNEM on different shared-memory hosts with up to 96 cores. We show that KNEM suffers less from process placement on these complex architectures. It improves throughput up to a factor of 2 for large messages for both point-to-point and collective operations, and significantly improves NPB execution time. We detail when to switch from one strategy to the other depending on the communication pattern and we show that I/OAT copy offload only appears to be an interesting solution for older architectures.

[1]  Torsten Hoefler,et al.  Implementation and performance analysis of non-blocking collective operations for MPI , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[2]  Guillaume Mercier,et al.  Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis , 2009, 2009 International Conference on Parallel Processing.

[3]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[4]  Guillaume Mercier,et al.  Implementation and evaluation of shared-memory communication and synchronization operations in MPICH2 using the Nemesis communication subsystem , 2007, Parallel Comput..

[5]  Sayantan Sur,et al.  Lightweight kernel-level primitives for high-performance MPI intra-node communication over multi-core systems , 2007, 2007 IEEE International Conference on Cluster Computing.

[6]  Dhabaleswar K. Panda,et al.  Efficient asynchronous memory copy operations on multi-core systems and I/OAT , 2007, 2007 IEEE International Conference on Cluster Computing.

[7]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[8]  Kevin T. Pedretti,et al.  SMARTMAP: Operating system support for efficient data sharing among processes on a multi-core processor , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[9]  Hyun-Wook Jin,et al.  Designing an Efficient Kernel-Level and User-Level Hybrid Approach for MPI Intra-Node Communication on Multi-Core Systems , 2008, 2008 37th International Conference on Parallel Processing.

[10]  Franck Cappello,et al.  MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[11]  Guillaume Mercier,et al.  Data Transfers between Processes in an SMP System: Performance Study and Application to MPI , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[12]  Brice Goglin,et al.  High Throughput Intra-Node MPI Communication with Open-MX , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[13]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[14]  Wei Huang,et al.  Performance Analysis and Evaluation of PCIe 2.0 and Quad-Data Rate InfiniBand , 2008, 2008 16th IEEE Symposium on High Performance Interconnects.