Optimization principles for collective neighborhood communications

Many scientific applications operate in a bulk-synchronous mode of iterative communication and computation steps. Even though the communication steps happen at the same logical time, important patterns such as stencil computations cannot be expressed as collective communications in MPI. We demonstrate how neighborhood collective operations allow to specify arbitrary collective communication relations during runtime and enable optimizations similar to traditional collective calls. We show a number of optimization opportunities and algorithms for different communication scenarios. We also show how users can assert constraints that provide additional optimization opportunities in a portable way. We demonstrate the utility of all described optimizations in a highly optimized implementation of neighborhood collective operations. Our communication and protocol optimizations result in a performance improvement of up to a factor of two for small stencil communications. We found that, for some patterns, our optimization heuristics automatically generate communication schedules that are comparable to hand-tuned collectives. With those optimizations in place, we are able to accelerate arbitrary collective communication patterns, such as regular and irregular stencils with optimization methods for collective communications. We expect that our methods will influence the design of future MPI libraries and provide a significant performance benefit on large-scale systems.

[1]  Richard M. Karp,et al.  Optimal broadcast and summation in the LogP model , 1993, SPAA '93.

[2]  George Bosilca,et al.  High Performance RDMA Protocols in HPC , 2006, PVM/MPI.

[3]  Torsten Hoefler,et al.  Implementation and performance analysis of non-blocking collective operations for MPI , 2007, Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07).

[4]  Michael M. Resch,et al.  Towards performance portability through runtime adaptation for high-performance computing applications , 2010, ISC 2010.

[5]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[6]  Torsten Hoefler,et al.  Sparse collective operations for MPI , 2009, 2009 IEEE International Symposium on Parallel & Distributed Processing.

[7]  Jack J. Dongarra,et al.  MPI Collective Algorithm Selection and Quadtree Encoding , 2006, PVM/MPI.

[8]  Uday Bondhugula,et al.  Effective automatic parallelization of stencil computations , 2007, PLDI '07.

[9]  Vipin Kumar,et al.  Parallel static and dynamic multi‐constraint graph partitioning , 2002, Concurr. Comput. Pract. Exp..

[10]  Torsten Hoefler,et al.  Sparse Non-blocking Collectives in Quantum Mechanical Calculations , 2008, PVM/MPI.

[11]  Timothy A. Davis,et al.  The university of Florida sparse matrix collection , 2011, TOMS.

[12]  Sanjay V. Rajopadhye,et al.  Towards Optimal Multi-level Tiling for Stencil Computations , 2007, 2007 IEEE International Parallel and Distributed Processing Symposium.

[13]  Leslie G. Valiant,et al.  A bridging model for parallel computation , 1990, CACM.

[14]  Xin Yuan,et al.  STAR-MPI: self tuned adaptive routines for MPI collective operations , 2006, ICS '06.

[15]  A. Krasnitz,et al.  Studying Quarks and Gluons On Mimd Parallel Computers , 1991, Int. J. High Perform. Comput. Appl..

[16]  Leonid Oliker,et al.  Communication Requirements and Interconnect Optimization for High-End Scientific Applications , 2007, IEEE Transactions on Parallel and Distributed Systems.

[17]  Rajeev Thakur,et al.  Improving the Performance of Collective Operations in MPICH , 2003, PVM/MPI.

[18]  Hubert Ritzdorf,et al.  The scalable process topology interface of MPI 2.2 , 2011, Concurr. Comput. Pract. Exp..

[19]  Chris J. Scheiman,et al.  LogGP: incorporating long messages into the LogP model—one step closer towards a realistic model for parallel computation , 1995, SPAA '95.

[20]  Torsten Hoefler,et al.  Group Operation Assembly Language - A Flexible Way to Express Collective Communication , 2009, 2009 International Conference on Parallel Processing.

[21]  Jeffrey S. Vetter,et al.  Communication characteristics of large-scale scientific applications for contemporary cluster architectures , 2002, Proceedings 16th International Parallel and Distributed Processing Symposium.

[22]  Alok N. Choudhary,et al.  Automatic optimization of communication in compiling out-of-core stencil codes , 1996, ICS '96.

[23]  GUNDOLF HAASE,et al.  Parallel Algebraic Multigrid Methods on Distributed Memory Computers , 2002, SIAM J. Sci. Comput..

[24]  Amotz Bar-Noy,et al.  Designing broadcasting algorithms in the postal model for message-passing systems , 2005, Mathematical systems theory.

[25]  Wei Shyy,et al.  Lattice Boltzmann Method for 3-D Flows with Curved Boundary , 2000 .

[26]  Joel H. Saltz,et al.  Communication Optimizations for Irregular Scientific Computations on Distributed Memory Architectures , 1994, J. Parallel Distributed Comput..

[27]  Sergei Gorlatch,et al.  Send-receive considered harmful: Myths and realities of message passing , 2004, TOPL.

[28]  Philip Heidelberger,et al.  Optimization of applications with non-blocking neighborhood collectives via multisends on the Blue Gene/P supercomputer , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS).

[29]  D. J. A. Welsh,et al.  An upper bound for the chromatic number of a graph and its application to timetabling problems , 1967, Comput. J..

[30]  Jesper Larsson Träff,et al.  Two-tree algorithms for full bandwidth broadcast, reduction and scan , 2009, Parallel Comput..

[31]  Jehoshua Bruck,et al.  Efficient algorithms for all-to-all communications in multi-port message-passing systems , 1994, SPAA '94.

[32]  Larry Kaplan,et al.  The Gemini System Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[33]  B. Bollobás The evolution of random graphs , 1984 .

[34]  Eli Upfal,et al.  Efficient Algorithms for All-to-All Communications in Multiport Message-Passing Systems , 1997, IEEE Trans. Parallel Distributed Syst..

[35]  P. Erdos,et al.  On the evolution of random graphs , 1984 .

[36]  George L.-T. Chiu,et al.  Overview of the Blue Gene/L system architecture , 2005, IBM J. Res. Dev..

[37]  Michael Woodacre The SGI® Altix 3000 Global Shared-Memory Architecture , 2003 .

[38]  William C. Skamarock,et al.  A time-split nonhydrostatic atmospheric model for weather research and forecasting applications , 2008, J. Comput. Phys..

[39]  Paul D. Gader,et al.  Image algebra techniques for parallel image processing , 1987 .

[40]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .