Optimizing non-blocking collective operations for infiniband

Non-blocking collective operations have recently been shown to be a promising complementary approach for overlapping communication and computation in parallel applications. However, in order to maximize the performance and usability of these operations it is important that they progress concurrently with the application without introducing CPU overhead and without requiring explicit user intervention. While studying non- blocking collective operations in the context of our portable library (libNBC), we found that most MPI implementations do not sufficiently support overlap over the InfiniBand network. To address this issue, we developed a low-level communication layer for libNBC based on the Open Fabrics InfiniBand verbs API. With this layer we are able to achieve high degrees of overlap without the need to explicitly progress the communication operations. We show that the communication overhead of parallel application kernels can be reduced up to 92% while not requiring user intervention to make progress.

[1]  Laxmikant V. Kalé,et al.  A framework for collective personalized communication , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[2]  Torsten Hoefler,et al.  Analysis of the Memory Registration Process in the Mellanox InfiniBand Software Stack , 2006, Euro-Par.

[3]  Torsten Hoefler,et al.  A practically constant-time MPI Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[4]  Charles L. Lawson,et al.  Basic Linear Algebra Subprograms for Fortran Usage , 1979, TOMS.

[5]  D. Martin Swany,et al.  Transformations to Parallel Codes for Communication-Computation Overlap , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[6]  Torsten Hoefler,et al.  Fast barrier synchronization for InfiniBand/spl trade/ , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[7]  Sergei Gorlatch,et al.  Send-receive considered harmful: Myths and realities of message passing , 2004, TOPL.

[8]  Torsten Hoefler,et al.  Netgauge: A Network Performance Measurement Framework , 2007, HPCC.

[9]  Torsten Hoefler,et al.  Fast barrier synchronization for InfiniBand , 2006 .

[10]  Costin Iancu,et al.  HUNTing the overlap , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[11]  Torsten Hoefler,et al.  Assessing Single-Message and Multi-Node Communication Performance of InfiniBand , 2006, International Symposium on Parallel Computing in Electrical Engineering (PARELEC'06).

[12]  Sayantan Sur,et al.  Zero-copy protocol for MPI using infiniband unreliable datagram , 2007, 2007 IEEE International Conference on Cluster Computing.

[13]  Torsten Hoefler,et al.  Scalable High Performance Message Passing over InfiniBand for Open MPI , 2007 .

[14]  Scott Pakin,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8, 192 Processors of ASCI Q , 2003, SC.

[15]  Torsten Hoefler,et al.  A Case for Non-blocking Collective Operations , 2006, ISPA Workshops.

[16]  Rossen Dimitrov,et al.  Overlapping of Communication and Computation and Early Binding: Fundamental Mechanisms for Improving , 2001 .

[17]  Dhabaleswar K. Panda,et al.  High performance RDMA-based MPI implementation over InfiniBand , 2003, ICS.

[18]  Torsten Hoefler,et al.  Design, Implementation, and Usage of LibNBC , 2006 .

[19]  Torsten Hoefler,et al.  Accurately measuring collective operations at massive scale , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[20]  Jeffrey M. Squyres,et al.  The Component Architecture of Open MPI: Enabling Third-Party Collective Algorithms* , 2005 .

[21]  J. White,et al.  An Analysis of Popular Mpi Implementations , .

[22]  Christopher Wilson,et al.  COMB: a portable benchmark suite for assessing MPI overlap , 2002, Proceedings. IEEE International Conference on Cluster Computing.

[23]  Torsten Hoefler,et al.  Low-Overhead LogGP Parameter Assessment for Modern Interconnection Networks , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[24]  Sayantan Sur,et al.  High performance MPI design using unreliable datagram for ultra-scale InfiniBand clusters , 2007, ICS '07.

[25]  George Bosilca,et al.  High Performance RDMA Protocols in HPC , 2006, PVM/MPI.

[26]  Torsten Hoefler,et al.  Implementation and performance analysis of non-blocking collective operations for MPI , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[27]  Sayantan Sur,et al.  RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits , 2006, PPoPP '06.

[28]  Werner Augustin,et al.  On Benchmarking Collective MPI Operations , 2002, PVM/MPI.

[29]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[30]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[31]  Alex Rapaport,et al.  Mpi-2: extensions to the message-passing interface , 1997 .

[32]  Torsten Hoefler,et al.  Non-Blocking Collective Operations for MPI-2 , 2006 .

[33]  I. Coorporation,et al.  Using the rdtsc instruction for performance monitoring , 1997 .

[34]  William Gropp,et al.  Reproducible Measurements of MPI Performance Characteristics , 1999, PVM/MPI.