论文信息 - Optimizing MPI communication within large multicore nodes with kernel assistance

Optimizing MPI communication within large multicore nodes with kernel assistance

As the number of cores per node increases in modern clusters, intra-node communication efficiency becomes critical to application performance. We present a study of the traditional double-copy model in MPICH2 and a kernel-assisted single-copy strategy with KNEM on different shared-memory hosts with up to 96 cores. We show that KNEM suffers less from process placement on these complex architectures. It improves throughput up to a factor of 2 for large messages for both point-to-point and collective operations, and significantly improves NPB execution time. We detail when to switch from one strategy to the other depending on the communication pattern and we show that I/OAT copy offload only appears to be an interesting solution for older architectures.

[1] Torsten Hoefler,et al. Implementation and performance analysis of non-blocking collective operations for MPI , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[2] Guillaume Mercier,et al. Cache-Efficient, Intranode, Large-Message MPI Communication with MPICH2-Nemesis , 2009, 2009 International Conference on Parallel Processing.

[3] David H. Bailey,et al. The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..

[4] Guillaume Mercier,et al. Implementation and evaluation of shared-memory communication and synchronization operations in MPICH2 using the Nemesis communication subsystem , 2007, Parallel Comput..

[5] Sayantan Sur,et al. Lightweight kernel-level primitives for high-performance MPI intra-node communication over multi-core systems , 2007, 2007 IEEE International Conference on Cluster Computing.

[6] Dhabaleswar K. Panda,et al. Efficient asynchronous memory copy operations on multi-core systems and I/OAT , 2007, 2007 IEEE International Conference on Cluster Computing.

[7] Message Passing Interface Forum. MPI: A message - passing interface standard , 1994 .

[8] Kevin T. Pedretti,et al. SMARTMAP: Operating system support for efficient data sharing among processes on a multi-core processor , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[9] Hyun-Wook Jin,et al. Designing an Efficient Kernel-Level and User-Level Hybrid Approach for MPI Intra-Node Communication on Multi-Core Systems , 2008, 2008 37th International Conference on Parallel Processing.

[10] Franck Cappello,et al. MPI versus MPI+OpenMP on the IBM SP for the NAS Benchmarks , 2000, ACM/IEEE SC 2000 Conference (SC'00).

[11] Guillaume Mercier,et al. Data Transfers between Processes in an SMP System: Performance Study and Application to MPI , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[12] Brice Goglin,et al. High Throughput Intra-Node MPI Communication with Open-MX , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[13] George Bosilca,et al. Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[14] Wei Huang,et al. Performance Analysis and Evaluation of PCIe 2.0 and Quad-Data Rate InfiniBand , 2008, 2008 16th IEEE Symposium on High Performance Interconnects.