pioman: A Pthread-Based Multithreaded Communication Engine

Recent cluster architectures include dozens of cores per node, with all cores sharing the network resources. To program such architectures, hybrid models mixing MPI+threads, and in particular MPI+OpenMP are gaining popularity. This imposes new requirements on communication libraries, such as the need for MPI_THREAD_MULTIPLE level of multi-threading support. Moreover, the high number of cores brings new opportunities to parallelize communication libraries, so as to have proper background progression of communication and communication/computation overlap. In this paper, we present pioman, a generic framework to be used by MPI implementations, that brings seamless asynchronous progression of communication by opportunistically using available cores. It uses system threads and thus is composable with any runtime system used for multithreading. Through various benchmarks, we demonstrate that our pioman-based MPI implementation exhibits very good properties regarding overlap, progression, and multithreading, and outperforms state-of-art MPI implementations.

[1]  J.C. Sancho,et al.  Quantifying the Potential Benefit of Overlapping Communication and Computation in Large-Scale Scientific Applications , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[2]  Alexandre Denis,et al.  An analysis of the impact of multi-threading on communication performance , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[3]  Ahmad Afsahi,et al.  Improving Communication Progress and Overlap in MPI Rendezvous Protocol over RDMA-enabled Interconnects , 2008, 2008 22nd International Symposium on High Performance Computing Systems and Applications.

[4]  Torsten Hoefler,et al.  Message progression in parallel computing - to thread or not to thread? , 2008, 2008 IEEE International Conference on Cluster Computing.

[5]  Sayantan Sur,et al.  Quantifying performance benefits of overlap using MPI-2 in a seismic modeling application , 2010, ICS '10.

[6]  Alexandre Denis,et al.  A scalable and generic task scheduling system for communication libraries , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[7]  Pavan Balaji,et al.  MT-MPI: multithreaded MPI for many-core environments , 2014, ICS '14.

[8]  Alexandre Denis,et al.  A multicore-enabled multirail communication engine , 2008, 2008 IEEE International Conference on Cluster Computing.

[9]  Guillaume Mercier,et al.  hwloc: A Generic Framework for Managing Hardware Affinities in HPC Applications , 2010, 2010 18th Euromicro Conference on Parallel, Distributed and Network-based Processing.

[10]  Brice Goglin,et al.  Impact of NUMA effects on high-speed networking with multi-opteron machines , 2007, PDCS 2007.

[11]  Raymond Namyst,et al.  Improving Reactivity and Communication Overlap in MPI Using a Generic I/O Manager , 2007, PVM/MPI.

[12]  Richard L. Graham,et al.  Open MPI: A Flexible High Performance MPI , 2005, PPAM.

[13]  Gerhard Wellein,et al.  Asynchronous MPI for the Masses , 2013, ArXiv.

[14]  Gerhard Wellein,et al.  Prospects for truly asynchronous communication with pure MPI and hybrid MPI/OpenMP on current supercomputing platforms , 2011 .

[15]  Paul E. McKenney,et al.  READ-COPY UPDATE: USING EXECUTION HISTORY TO SOLVE CONCURRENCY PROBLEMS , 2002 .

[16]  David E. Bernholdt,et al.  A framework for characterizing overlap of communication and computation in parallel applications , 2008, Cluster Computing.

[17]  Sayantan Sur,et al.  RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits , 2006, PPoPP '06.

[18]  Raymond Namyst,et al.  Short Paper : Dynamic Optimization of Communications over High Speed Networks , 2006, 2006 15th IEEE International Conference on High Performance Distributed Computing.

[19]  Rajeev Thakur,et al.  Test suite for evaluating performance of multithreaded MPI communication , 2009, Parallel Comput..

[20]  Sivarama P. Dandamudi Reducing Run Queue Contention in Shared Memory Multiprocessors , 1997, Computer.

[21]  John D. Valois Lock-free linked lists using compare-and-swap , 1995, PODC '95.