MPI Overlap: Benchmark and Analysis

In HPC applications, one of the major overhead compared to sequential code, is communication cost. Application programmers often amortize this cost by overlapping communications with computation. To do so, they post a non-blocking MPI request, perform computation, and wait for communication completion, assuming MPI communication will progress in background. In this paper, we propose to measure what really happens when trying to overlap non-blocking point-to-point communications with computation. We explain how background progression works, we describe relevant test cases, we identify challenges for a benchmark, then we propose a benchmark suite to measure how much overlap happen in various cases. We exhibit overlap benchmark results on a wide panel of MPI libraries and hardware platforms. Finally, we classify, analyze, and explain the results using low-level traces to reveal the internal behavior of the MPI library.

[1]  David E. Bernholdt,et al.  A framework for characterizing overlap of communication and computation in parallel applications , 2008, Cluster Computing.

[2]  John Shalf,et al.  The International Exascale Software Project roadmap , 2011, Int. J. High Perform. Comput. Appl..

[3]  Sayantan Sur,et al.  RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits , 2006, PPoPP '06.

[4]  Richard L. Graham,et al.  Open MPI: A Flexible High Performance MPI , 2005, PPAM.

[5]  Alexandre Denis,et al.  A scalable and generic task scheduling system for communication libraries , 2009, 2009 IEEE International Conference on Cluster Computing and Workshops.

[6]  Sayantan Sur,et al.  Quantifying performance benefits of overlap using MPI-2 in a seismic modeling application , 2010, ICS '10.

[7]  J.C. Sancho,et al.  Quantifying the Potential Benefit of Overlapping Communication and Computation in Large-Scale Scientific Applications , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[8]  Gerhard Wellein,et al.  Prospects for truly asynchronous communication with pure MPI and hybrid MPI/OpenMP on current supercomputing platforms , 2011 .

[9]  Ahmad Afsahi,et al.  Improving Communication Progress and Overlap in MPI Rendezvous Protocol over RDMA-enabled Interconnects , 2008, 2008 22nd International Symposium on High Performance Computing Systems and Applications.

[10]  Torsten Hoefler,et al.  Message progression in parallel computing - to thread or not to thread? , 2008, 2008 IEEE International Conference on Cluster Computing.

[11]  Torsten Hoefler,et al.  Accurately measuring overhead, communication time and progression of blocking and nonblocking collective operations at massive scale , 2010, Int. J. Parallel Emergent Distributed Syst..

[12]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[13]  Jack J. Dongarra,et al.  EZTrace: A Generic Framework for Performance Analysis , 2011, 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing.

[14]  Alexandre Denis,et al.  pioman: A Pthread-Based Multithreaded Communication Engine , 2015, 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing.

[15]  Christopher Wilson,et al.  COMB: a portable benchmark suite for assessing MPI overlap , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[16]  Patrick Carribault,et al.  MPC-MPI: An MPI Implementation Reducing the Overall Memory Consumption , 2009, PVM/MPI.

[17]  William Gropp,et al.  Reproducible Measurements of MPI Performance Characteristics , 1999, PVM/MPI.

[18]  William Gropp,et al.  The blue waters super-system for super-science , 2013 .