Runtime detection and optimization of collective communication patterns

Parallelism is steadily growing, remote-data access will soon dominate the execution time of large-scale applications. Many large-scale communication patterns expose significant structure that can be used to schedule communications accordingly. In this work, we identify concurrent communication patterns and transform them to semantically equivalent but faster communications. We show a directed acyclic graph formulation for communication schedules and concisely define their synchronization and data movement semantics. Our dataflow solver computes an internal representation (IR) that is amenable to pattern detection. We demonstrate a detection algorithm for our IR that is guaranteed to detect communication kernels on subsets of the graph and replace the subgraph with hardware accelerated or hand-tuned kernels. Those techniques are implemented in an open-source detection and transformation framework to optimize communication patterns. Experiments show that our techniques can improve the performance of representative example codes by several orders of magnitude on two different systems. However, we also show that some collective detection problems on process subsets are NP-hard. The developed analysis techniques are a first important step towards automatic large-scale communication transformations. Our developed techniques open several avenues for additional transformation heuristics and analyses. We expect that such communication analyses and transformations will become as natural as pattern detection, just-in-time compiler optimizations, and autotuning are today for serial codes.

[1]  Sandia Report,et al.  The Portals 4.0 Message Passing Interface , 2008 .

[2]  Martin Schulz,et al.  Detecting Patterns in MPI Communication Traces , 2008, 2008 37th International Conference on Parallel Processing.

[3]  Zhaofang Wen,et al.  Automatic Algorithm Recognition and Replacement: A New Approach to Program Optimization , 2000 .

[4]  Katherine Yelick,et al.  UPC Language Specifications V1.1.1 , 2003 .

[5]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[6]  Jehoshua Bruck,et al.  Efficient algorithms for all-to-all communications in multi-port message-passing systems , 1994, SPAA '94.

[7]  Jesper Larsson Träff,et al.  Two-tree algorithms for full bandwidth broadcast, reduction and scan , 2009, Parallel Comput..

[8]  Larry Kaplan,et al.  The Gemini System Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[9]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[10]  Michael L. Scott,et al.  Fast, contention-free combining tree barriers for shared-memory multiprocessors , 1994, International Journal of Parallel Programming.

[11]  Martin Schulz,et al.  Transforming MPI source code based on communication patterns , 2010, Future Gener. Comput. Syst..

[12]  Greg Bronevetsky,et al.  Communication-Sensitive Static Dataflow for Parallel Message Passing Applications , 2009, 2009 International Symposium on Code Generation and Optimization.

[13]  P. Strevens Iii , 1985 .

[14]  Torsten Hoefler,et al.  Communication-centric optimizations by dynamically detecting collective operations , 2012, PPoPP '12.

[15]  Torsten Hoefler,et al.  The PERCS High-Performance Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[16]  Tarek A. El-Ghazawi,et al.  An evaluation of global address space languages: co-array fortran and unified parallel C , 2005, PPoPP.

[17]  Vivek Sarkar,et al.  X10: an object-oriented approach to non-uniform cluster computing , 2005, OOPSLA '05.

[18]  Jesper Larsson Träff,et al.  Self-Consistent MPI Performance Guidelines , 2010, IEEE Transactions on Parallel and Distributed Systems.

[19]  Dieter Kranzlmüller,et al.  Detection of Collective MPI Operation Patterns , 2004, PVM/MPI.

[20]  Martin Schulz,et al.  A Scalable and Distributed Dynamic Formal Verifier for MPI Programs , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[21]  Hans-Wolfgang Loidl,et al.  Semi-Explicit Parallel Programming in a Purely Functional Style: GpH , 2009 .

[22]  References , 1971 .

[23]  Robert A. van de Geijn,et al.  High-performance implementation of the level-3 BLAS , 2008, TOMS.

[24]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[25]  Bradford L. Chamberlain,et al.  The cascade high productivity language , 2004, Ninth International Workshop on High-Level Parallel Programming Models and Supportive Environments, 2004. Proceedings..

[26]  Jack J. Dongarra,et al.  Performance analysis of MPI collective operations , 2005, 19th IEEE International Parallel and Distributed Processing Symposium.

[27]  Philip Heidelberger,et al.  The deep computing messaging framework: generalized scalable message passing on the blue gene/P supercomputer , 2008, ICS '08.

[28]  J. Dongarra Performance of various computers using standard linear equations software , 1990, CARN.

[29]  Martin Schulz,et al.  Using MPI Communication Patterns to Guide Source Code Transformations , 2008, ICCS.

[30]  Robert W. Numrich,et al.  Co-array Fortran for parallel programming , 1998, FORF.

[31]  William N. Scherer,et al.  A new vision for coarray Fortran , 2009, PGAS '09.

[32]  Steve Poole,et al.  ConnectX-2 InfiniBand Management Queues: First Investigation of the New Support for Network Offloaded Collective Operations , 2010, 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing.

[33]  Keith D. Underwood,et al.  Fine-Grained Message Pipelining for Improved MPI Performance , 2006, 2006 IEEE International Conference on Cluster Computing.

[34]  Amitabha Sanyal,et al.  Data Flow Analysis - Theory and Practice , 2009 .

[35]  Scott Pakin Receiver-initiated message passing over RDMA Networks , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.