MPI datatype processing using runtime compilation

Data packing before and after communication can make up as much as 90% of the communication time on modern computers. Despite MPI's well-defined datatype interface for non-contiguous data access, many codes use manual pack loops for performance reasons. Programmers write access-pattern specific pack loops (e.g., do manual unrolling) for which compilers emit optimized code. In contrast, MPI implementations in use today interpret datatypes at pack time, resulting in high overheads. In this work we explore the effectiveness of using runtime compilation techniques to generate efficient and optimized pack code for MPI datatypes at commit time. Thus, none of the overhead of datatype interpretation is incurred at pack time and pack setup is as fast as calling a function pointer. We have implemented a library called libpack that can be used to compile and (un)pack MPI datatypes. The library optimizes the datatype representation and uses the LLVM framework to produce vectorized machine code for each datatype at commit time. We show several examples of how MPI datatype pack functions benefit from runtime compilation and analyze the performance of compiled pack functions for the data access patterns in many applications. We show that the pack/unpack functions generated by our packing library are seven times faster than those of prevalent MPI implementations for 73% of the datatypes used in a scientific application and in many cases outperform manual pack loops.

[1]  Torsten Hoefler,et al.  Performance Expectations and Guidelines for MPI Derived Datatypes , 2011, EuroMPI.

[2]  Jesper Larsson Träff,et al.  Using MPI Derived Datatypes in Numerical Libraries , 2011, EuroMPI.

[3]  Hubert Ritzdorf,et al.  Flattening on the Fly: Efficient Handling of MPI Derived Datatypes , 1999, PVM/MPI.

[4]  Jeroen Tromp,et al.  High-frequency simulations of global seismic wave propagation using SPECFEM3D_GLOBE on 62K processors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[5]  A. Krasnitz,et al.  Studying Quarks and Gluons On Mimd Parallel Computers , 1991, Int. J. High Perform. Comput. Appl..

[6]  Torsten Hoefler,et al.  Parallel Zero-Copy Algorithms for Fast Fourier Transform and Conjugate Gradient Using MPI Datatypes , 2010, EuroMPI.

[7]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[8]  Torsten Hoefler,et al.  Micro-applications for Communication Data Access Patterns and MPI Datatypes , 2012, EuroMPI.

[9]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[10]  P. Cochat,et al.  Et al , 2008, Archives de pediatrie : organe officiel de la Societe francaise de pediatrie.

[11]  Daniel A. Reed,et al.  Stencils and Problem Partitionings: Their Influence on the Performance of Multiple Processor Systems , 1987, IEEE Transactions on Computers.

[12]  Michael A. Laurenzano,et al.  High-frequency simulations of global seismic wave propagation using SPECFEM3D_GLOBE on 62K processors , 2008, HiPC 2008.

[13]  William C. Skamarock,et al.  A time-split nonhydrostatic atmospheric model for weather research and forecasting applications , 2008, J. Comput. Phys..

[14]  R. V. D. Wijngaart NAS Parallel Benchmarks Version 2.4 , 2022 .

[15]  Surendra Byna,et al.  Improving the performance of MPI derived datatypes by optimizing memory-access cost , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[16]  Nagiza F. Samatova,et al.  Enabling Fast, Noncontiguous GPU Data Movement in Hybrid MPI+GPU Environments , 2012, 2012 IEEE International Conference on Cluster Computing.

[17]  Torsten Hoefler,et al.  Automatic datatype generation and optimization , 2012, PPoPP '12.

[18]  Ewing Lusk,et al.  Improving the Performance of MPI Derived Datatypes , 2010 .