Leveraging non-blocking collective communication in high-performance applications

Although overlapping communication with computation is an important mechanism for achieving high performance in parallel programs, developing applications that actually achieve good overlap can be difficult. Existing approaches are typically based on manual or compiler-based transformations. This paper presents a pattern and library-based approach to optimizing collective communication in parallel high-performance applications, based on using non-blocking collective operations to enable overlapping of communication and computation. Common communication and computation patterns in iterative SPMD computations are used to motivate the transformations we present. Our approach provides the programmer with the capability to separately optimize communication and computation in an application, while automating the interaction between computation and communication to achieve maximum overlap. Performance results with a model application show more than a 90% decrease in communication overhead, resulting in 21% overall performance improvements.

[1]  G. Liu,et al.  Overlap of Computation and Communication on Shared-Memory , 1999, Scalable Comput. Pract. Exp..

[2]  Torsten Hoefler,et al.  Optimizing a conjugate gradient solver with non-blocking collective operations , 2007, Parallel Comput..

[3]  Jason Duell,et al.  An evaluation of current high-performance networks , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[4]  Katherine A. Yelick,et al.  A performance analysis of the Berkeley UPC compiler , 2003, ICS '03.

[5]  F. Petrini,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[6]  Torsten Hoefler,et al.  Optimizing a conjugate gradient solver with non-blocking collective operations , 2006, Parallel Comput..

[7]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[8]  Torsten Hoefler,et al.  Low-Overhead LogGP Parameter Assessment for Modern Interconnection Networks , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[9]  Andrea C. Arpaci-Dusseau,et al.  Parallel programming in Split-C , 1993, Supercomputing '93. Proceedings.

[10]  Rice UniversityCORPORATE,et al.  High performance Fortran language specification , 1993 .

[11]  Amith R. Mamidala,et al.  Fast and scalable MPI-level broadcast using InfiniBand's hardware multicast support , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[12]  Darren J. Kerbyson,et al.  MPI tools and performance studies - Quantifying the potential benefit of overlapping communication and computation in large-scale scientific applications , 2006, SC.

[13]  Ken Kennedy,et al.  Compiler optimizations for Fortran D on MIMD distributed-memory machines , 1991, Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Supercomputing '91).

[14]  Torsten Hoefler,et al.  A Case for Standard Non-blocking Collective Operations , 2007, PVM/MPI.

[15]  Torsten Hoefler,et al.  Adding Low-Cost Hardware Barrier Support to Small Commodity Clusters , 2006, ARCS Workshops.

[16]  Steven G. Johnson,et al.  The Design and Implementation of FFTW3 , 2005, Proceedings of the IEEE.

[17]  Torsten Hoefler,et al.  Implementation and performance analysis of non-blocking collective operations for MPI , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[18]  Mark J. Clement,et al.  Overlapping Computations, Communications and I/O in parallel Sorting , 1995, J. Parallel Distributed Comput..

[19]  Rossen Dimitrov,et al.  Overlapping of Communication and Computation and Early Binding: Fundamental Mechanisms for Improving , 2001 .

[20]  Denis Caromel,et al.  Optimizing Metacomputing with Communication-Computation Overlap , 2001, PaCT.

[21]  Jack J. Dongarra,et al.  Performance Study of LU Factorization with Low Communication Overhead on Multiprocessors , 1995, Parallel Process. Lett..

[22]  Christophe Calvin,et al.  Minimizing Communication Overhead Using Pipelining for Multi-Dimensional FFT on Distributed Memory Machines , 1993, PARCO.

[23]  Anshu Dubey,et al.  Redistribution strategies for portable parallel FFT: a case study , 2001, Concurr. Comput. Pract. Exp..

[24]  T. von Eicken,et al.  Parallel programming in Split-C , 1993, Supercomputing '93.

[25]  Katherine A. Yelick,et al.  Optimizing bandwidth limited problems using one-sided communication and overlap , 2005, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[26]  David A. Patterson,et al.  Computer Architecture: A Quantitative Approach , 1969 .

[27]  Susan Coghlan,et al.  The Influence of Operating Systems on the Performance of Collective Operations at Extreme Scale , 2006, 2006 IEEE International Conference on Cluster Computing.

[28]  Stefan Goedecker,et al.  An efficient 3-dim FFT for plane wave electronic structure calculations on massively parallel machines composed of multiprocessor nodes , 2003 .

[29]  Tarek S. Abdelrahman,et al.  Computation-Communication Overlap on Network-of-Workstation Multiprocessors , 2001 .

[30]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[31]  Chung-Ta King,et al.  Pipelined Data Parallel Algorithms-I: Concept and Modeling , 1990, IEEE Trans. Parallel Distributed Syst..

[32]  J.C. Sancho,et al.  Quantifying the Potential Benefit of Overlapping Communication and Computation in Large-Scale Scientific Applications , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[33]  Jonathan W. Berry,et al.  Challenges in Parallel Graph Processing , 2007, Parallel Process. Lett..

[34]  Torsten Hoefler,et al.  Optimizing non-blocking collective operations for infiniband , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[35]  Sergei Gorlatch,et al.  Send-receive considered harmful: Myths and realities of message passing , 2004, TOPL.

[36]  Laxmikant V. Kalé,et al.  A framework for collective personalized communication , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[37]  James A. Storer,et al.  Parallel algorithms for data compression , 1985, JACM.

[38]  J. White,et al.  An Analysis of Popular Mpi Implementations , .

[39]  Keith D. Underwood,et al.  Implications of application usage characteristics for collective communication offload , 2006, Int. J. High Perform. Comput. Netw..

[40]  D. Martin Swany,et al.  Transformations to Parallel Codes for Communication-Computation Overlap , 2005, ACM/IEEE SC 2005 Conference (SC'05).

[41]  Jack Dongarra,et al.  Tiling on systems with communication/computation overlap , 1999 .

[42]  Torsten Hoefler,et al.  A Case for Non-blocking Collective Operations , 2006, ISPA Workshops.