Maximizing Communication–Computation Overlap Through Automatic Parallelization and Run-time Tuning of Non-blocking Collective Operations

Non-blocking collective communication operations extend the concept of collective operations by offering the additional benefit of being able to overlap communication and computation. They are often considered key building blocks for scaling applications to very large process counts. Yet, using non-blocking collective operations in real-world applications is non-trivial. Application codes often have to be restructured significantly in order to maximize the communication–computation overlap. This paper presents an approach to maximize the communication–computation overlap for hybrid OpenMP/MPI applications. The work leverages automatic parallelization by extending the ability of an existing tool to utilize non-blocking collective operations. It further integrates run-time auto-tuning techniques of non-blocking collective operations, optimizing both, the algorithms used for the non-blocking collective operations as well as location and frequency of accompanying progress function calls. Four application benchmarks were used to demonstrate the efficiency and versatility of the approach on two different platforms. The results indicate significant performance improvements in virtually all test scenarios. The resulting parallel applications achieved a performance improvement of up to 43% compared to the version using blocking communication operations, and up to 95% of the maximum theoretical communication–computation overlap identified for each scenario.

[1]  Torsten Hoefler,et al.  Implementation and performance analysis of non-blocking collective operations for MPI , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[2]  Dhabaleswar K. Panda,et al.  Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[3]  Xin Yuan,et al.  A study of process arrival patterns for MPI collective operations , 2007, ICS.

[4]  Mary W. Hall,et al.  Non-affine Extensions to Polyhedral Code Generation , 2014, CGO '14.

[5]  Paul Feautrier,et al.  Some efficient solutions to the affine scheduling problem. Part II. Multidimensional time , 1992, International Journal of Parallel Programming.

[6]  Dhabaleswar K. Panda,et al.  A Novel Functional Partitioning Approach to Design High-Performance MPI-3 Non-blocking Alltoallv Collective on Multi-core Systems , 2013, 2013 42nd International Conference on Parallel Processing.

[7]  Albert Cohen,et al.  The Polyhedral Model Is More Widely Applicable Than You Think , 2010, CC.

[8]  Craig A. Knoblock,et al.  Advanced Programming in the UNIX Environment , 1992, Addison-Wesley professional computing series.

[9]  Hironori Kasahara,et al.  Cache Optimization for Coarse Grain Task Parallel Processing Using Inter-Array Padding , 2003, LCPC.

[10]  Paul Feautrier,et al.  Some efficient solutions to the affine scheduling problem. I. One-dimensional time , 1992, International Journal of Parallel Programming.

[11]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[12]  Uday Bondhugula Compiling affine loop nests for distributed-memory parallel architectures , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[13]  Ahmad Afsahi,et al.  Designing an Offloaded Nonblocking MPI_Allgather Collective Using CORE-Direct , 2012, 2012 IEEE International Conference on Cluster Computing.

[14]  Torsten Hoefler,et al.  Message progression in parallel computing - to thread or not to thread? , 2008, 2008 IEEE International Conference on Cluster Computing.

[15]  Sayantan Sur,et al.  Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters: A Case Study with HPL , 2011, 2011 IEEE 19th Annual Symposium on High Performance Interconnects.

[16]  Uday Bondhugula,et al.  Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model , 2008, CC.

[17]  Edgar Gabriel,et al.  Auto-tuning Non-blocking Collective Communication Operations , 2015, 2015 IEEE International Parallel and Distributed Processing Symposium Workshop.

[18]  FeautrierPaul Some efficient solutions to the affine scheduling problem , 1992 .

[19]  Kathryn S. McKinley,et al.  Tile size selection using cache organization and data layout , 1995, PLDI '95.

[20]  Uday Bondhugula,et al.  A practical automatic polyhedral parallelizer and locality optimizer , 2008, PLDI '08.

[21]  L. Dagum,et al.  OpenMP: an industry standard API for shared-memory programming , 1998 .

[22]  Jeffrey K. Hollingsworth,et al.  Designing and auto-tuning parallel 3-D FFT for computation-communication overlap , 2014, PPoPP '14.

[23]  Torsten Hoefler,et al.  Leveraging non-blocking collective communication in high-performance applications , 2008, SPAA '08.

[24]  Paul Feautrier,et al.  Dataflow analysis of array and scalar references , 1991, International Journal of Parallel Programming.

[25]  Sayantan Sur,et al.  High-performance and scalable non-blocking all-to-all with collective offload on InfiniBand clusters: a study with parallel 3D FFT , 2011, Computer Science - Research and Development.

[26]  Stephen A. Rago,et al.  Advanced Programming in the UNIX(R) Environment (2nd Edition) , 2005 .

[27]  Michael M. Resch,et al.  Towards performance portability through runtime adaptation for high-performance computing applications , 2010, ISC 2010.

[28]  Shoaib Kamil,et al.  OpenTuner: An extensible framework for program autotuning , 2014, 2014 23rd International Conference on Parallel Architecture and Compilation (PACT).

[29]  Albert Cohen,et al.  Putting Polyhedral Loop Transformations to Work , 2003, LCPC.

[30]  George Bosilca,et al.  Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation , 2004, PVM/MPI.

[31]  William Gropp,et al.  Mpi---the complete reference: volume 1 , 1998 .

[32]  Chun Chen,et al.  Auto-tuning full applications: A case study , 2011, Int. J. High Perform. Comput. Appl..

[33]  Torsten Hoefler,et al.  A Case for Standard Non-blocking Collective Operations , 2007, PVM/MPI.

[34]  Uday Bondhugula,et al.  Generating efficient data movement code for heterogeneous architectures with distributed-memory , 2013, Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques.

[35]  William Gropp,et al.  Mpi - The Complete Reference: Volume 2, the Mpi Extensions , 1998 .