Application-oriented ping-pong benchmarking: how to assess the real communication overheads

Moving data between processes has often been discussed as one of the major bottlenecks in parallel computing—there is a large body of research, striving to improve communication latency and bandwidth on different networks, measured with ping-pong benchmarks of different message sizes. In practice, the data to be communicated generally originates from application data structures and needs to be serialized before communicating it over serial network channels. This serialization is often done by explicitly copying the data to communication buffers. The message passing interface (MPI) standard defines derived datatypes to allow zero-copy formulations of non-contiguous data access patterns. However, many applications still choose to implement manual pack/unpack loops, partly because they are more efficient than some MPI implementations. MPI implementers on the other hand do not have good benchmarks that represent important application access patterns. We demonstrate that the data serialization can consume up to 80 % of the total communication overhead for important applications. This indicates that most of the current research on optimizing serial network transfer times may be targeted at the smaller fraction of the communication overhead. To support the scientific community, we extracted the send/recv-buffer access patterns of a representative set of scientific applications to build a benchmark that includes serialization and communication of application data and thus reflects all communication overheads. This can be used like traditional ping-pong benchmarks to determine the holistic communication latency and bandwidth as observed by an application. It supports serialization loops in C and Fortran as well as MPI datatypes for representative application access patterns. Our benchmark, consisting of seven micro-applications, unveils significant performance discrepancies between the MPI datatype implementations of state of the art MPI implementations. Our micro-applications aim to provide a standard benchmark for MPI datatype implementations to guide optimizations similarly to the established benchmarks SPEC CPU and Livermore Loops.

[1]  Hubert Ritzdorf,et al.  Flattening on the Fly: Efficient Handling of MPI Derived Datatypes , 1999, PVM/MPI.

[2]  Jeroen Tromp,et al.  High-frequency simulations of global seismic wave propagation using SPECFEM3D_GLOBE on 62K processors , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[3]  William C. Skamarock,et al.  A time-split nonhydrostatic atmospheric model for weather research and forecasting applications , 2008, J. Comput. Phys..

[4]  Dhabaleswar K. Panda,et al.  High performance implementation of MPI derived datatype communication over InfiniBand , 2004, 18th International Parallel and Distributed Processing Symposium, 2004. Proceedings..

[5]  Mohamed Sayeed,et al.  HPC Benchmarking and Performance Evaluation With Realistic Applications , 2006 .

[6]  Torsten Hoefler,et al.  Performance Expectations and Guidelines for MPI Derived Datatypes , 2011, EuroMPI.

[7]  Larry Kaplan,et al.  The Gemini System Interconnect , 2010, 2010 18th IEEE Symposium on High Performance Interconnects.

[8]  A. Krasnitz,et al.  Studying Quarks and Gluons On Mimd Parallel Computers , 1991, Int. J. High Perform. Comput. Appl..

[9]  Alan B. Williams,et al.  Poster: mini-applications: vehicles for co-design , 2011, SC '11 Companion.

[10]  Jesper Larsson Träff,et al.  A Benchmark for MPI Derived Datatypes , 2000, PVM/MPI.

[11]  F. H. Mcmahon,et al.  The Livermore Fortran Kernels: A Computer Test of the Numerical Performance Range , 1986 .

[12]  T A Brunner Mulard: A Multigroup Thermal Radiation Diffusion Mini-Application , 2012 .

[13]  Alexander Aiken,et al.  Optimal loop parallelization , 1988, PLDI '88.

[14]  Torsten Hoefler,et al.  Micro-applications for Communication Data Access Patterns and MPI Datatypes , 2012, EuroMPI.

[15]  Surendra Byna,et al.  Improving the performance of MPI derived datatypes by optimizing memory-access cost , 2003, 2003 Proceedings IEEE International Conference on Cluster Computing.

[16]  Sandia Report,et al.  Improving Performance via Mini-applications , 2009 .

[17]  Torsten Hoefler,et al.  Parallel Zero-Copy Algorithms for Fast Fourier Transform and Conjugate Gradient Using MPI Datatypes , 2010, EuroMPI.

[18]  Steve Plimpton,et al.  Fast parallel algorithms for short-range molecular dynamics , 1993 .

[19]  Jesper Larsson Träff,et al.  Using MPI Derived Datatypes in Numerical Libraries , 2011, EuroMPI.

[20]  Kaivalya M. Dixit,et al.  The SPEC benchmarks , 1991, Parallel Comput..

[21]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[22]  R. V. D. Wijngaart NAS Parallel Benchmarks Version 2.4 , 2022 .