Delta Send-Recv for Dynamic Pipelining in MPI Programs

Pipelining is necessary for efficient do-across parallelism but the use is difficult to automate because it requires send-receive analysis and loop blocking in both sender and receiver code. The blocking factor is statically chosen. This paper presents a new interface called delta send-recv. Through compiler and run-time support, it enables dynamic pipelining. In program code, the interface is used to mark the related computation and communication. There is no need to restructure the computation code or compose multiple messages. At run time, the message size is dynamically determined, and multiple pipelines are chained among all tasks that participate in the delta communication. The new system is tested on kernel and reduced NAS benchmarks to show that it simplifies message-passing programming and improves program performance.

[1]  D. Martin Swany,et al.  MPI-aware compiler optimizations for improving communication-computation overlap , 2009, ICS.

[2]  Eduard Ayguadé,et al.  Overlapping communication and computation by using a hybrid MPI/SMPSs approach , 2010, ICS '10.

[3]  Keith D. Cooper,et al.  Engineering a Compiler , 2003 .

[4]  Martin Burtscher,et al.  Tolerating Message Latency Through the Early Release of Blocked Receives , 2005, Euro-Par.

[5]  J.C. Sancho,et al.  Quantifying the Potential Benefit of Overlapping Communication and Computation in Large-Scale Scientific Applications , 2006, ACM/IEEE SC 2006 Conference (SC'06).

[6]  K. Timson,et al.  Center for research on parallel computation , 1992 .

[7]  Rajeev Thakur,et al.  Optimization of Collective Communication Operations in MPICH , 2005, Int. J. High Perform. Comput. Appl..

[8]  Martin Schulz,et al.  Exploiting Data Similarity to Reduce Memory Footprints , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[9]  Paul D. Hovland,et al.  Data-Flow Analysis for MPI Programs , 2006, 2006 International Conference on Parallel Processing (ICPP'06).

[10]  Jack J. Dongarra,et al.  Overlapping Computation and Communication for Advection on Hybrid Parallel Computers , 2011, 2011 IEEE International Parallel & Distributed Processing Symposium.

[11]  Eric P. Chassignet,et al.  North Atlantic Simulations with the Hybrid Coordinate Ocean Model (HYCOM): Impact of the Vertical Coordinate Choice, Reference Pressure, and Thermobaricity , 2003 .

[12]  Ken Kennedy,et al.  A balanced code placement framework , 2000, TOPL.

[13]  Ken Kennedy,et al.  Advanced optimization strategies in the Rice dHPF compiler , 2002, Concurr. Comput. Pract. Exp..

[14]  Jesús Labarta,et al.  A dependency-aware task-based programming environment for multi-core architectures , 2008, 2008 IEEE International Conference on Cluster Computing.

[15]  Torsten Hoefler,et al.  Implementation and performance analysis of non-blocking collective operations for MPI , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[16]  Yuan Zhang,et al.  Barrier matching for programs with textually unaligned barriers , 2007, PPoPP.

[17]  Ken Kennedy,et al.  Optimizing Compilers for Modern Architectures: A Dependence-based Approach , 2001 .

[18]  George Bosilca,et al.  TEG: A High-Performance, Scalable, Multi-network Point-to-Point Communications Methodology , 2004, PVM/MPI.

[19]  Jesper Larsson Träff,et al.  A Simple, Pipelined Algorithm for Large, Irregular All-gather Problems , 2008, PVM/MPI.

[20]  Michael Wolfe,et al.  A New Approach to Array Redistribution: Strip Mining Redistribution , 1994, PARLE.

[21]  Martin Schulz,et al.  Transforming MPI source code based on communication patterns , 2010, Future Gener. Comput. Syst..

[22]  Torsten Hoefler,et al.  Non-Blocking Collective Operations for MPI-2 , 2006 .

[23]  Costin Iancu,et al.  HUNTing the overlap , 2005, 14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05).

[24]  Greg Bronevetsky,et al.  Communication-Sensitive Static Dataflow for Parallel Message Passing Applications , 2009, 2009 International Symposium on Code Generation and Optimization.

[25]  David H. Bailey,et al.  The Nas Parallel Benchmarks , 1991, Int. J. High Perform. Comput. Appl..