Desynchronization and Wave Pattern Formation in MPI-Parallel and Hybrid Memory-Bound Programs

Analytic, first-principles performance modeling of distributed-memory parallel codes is notoriously imprecise. Even for applications with extremely regular and homogeneous compute-communicate phases, simply adding communication time to computation time does often not yield a satisfactory prediction of parallel runtime due to deviations from the expected simple lockstep pattern caused by system noise, variations in communication time, and inherent load imbalance. In this paper, we highlight the specific cases of provoked and spontaneous desynchronization of memory-bound, bulk-synchronous pure MPI and hybrid MPI+OpenMP programs. Using simple microbenchmarks we observe that although desynchronization can introduce increased waiting time per process, it does not necessarily cause lower resource utilization but can lead to an increase in available bandwidth per core. In case of significant communication overhead, even natural noise can shove the system into a state of automatic overlap of communication and computation, improving the overall time to solution. The saturation point, i.e., the number of processes per memory domain required to achieve full memory bandwidth, is pivotal in the dynamics of this process and the emerging stable wave pattern. We also demonstrate how hybrid MPI-OpenMP programming can prevent desirable desynchronization by eliminating the bandwidth bottleneck among processes. A Chebyshev filter diagonalization application is used to demonstrate some of the observed effects in a realistic setting.

[1]  Gerhard Wellein,et al.  Chebyshev Filter Diagonalization on Modern Manycore Processors and GPGPUs , 2018, ISC.

[2]  Katherine E. Isaacs,et al.  There goes the neighborhood: Performance degradation due to nearby jobs , 2013, 2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC).

[3]  Roger W. Hockney,et al.  The Communication Challenge for MPP: Intel Paragon and Meiko CS-2 , 1994, Parallel Computing.

[4]  Georg Hager,et al.  Hybrid MPI/OpenMP Parallel Programming on Clusters of Multi-Core SMP Nodes , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[5]  Markus Geimer,et al.  Identifying the Root Causes of Wait States in Large-Scale Parallel Applications , 2010, 2010 39th International Conference on Parallel Processing.

[6]  George Michelogiannakis,et al.  The Pitfalls of Provisioning Exascale Networks: A Trace Replay Analysis for Understanding Communication Performance , 2018, ISC.

[7]  F. Petrini,et al.  The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[8]  Georg Hager,et al.  On the accuracy and usefulness of analytic energy models for contemporary multicore processors , 2018, ISC.

[9]  Gerhard Wellein,et al.  Bridging the Architecture Gap: Abstracting Performance-Relevant Properties of Modern Server Processors , 2020, Supercomput. Front. Innov..

[10]  Gerhard Wellein,et al.  High-performance implementation of Chebyshev filter diagonalization for interior eigenvalue computations , 2015, J. Comput. Phys..

[11]  David W. Walker,et al.  Performance analysis of a hybrid MPI/OpenMP application on multi-core clusters , 2010, J. Comput. Sci..

[12]  Gerhard Wellein,et al.  Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache-Memory Model , 2014, ICS.

[13]  Xingfu Wu,et al.  Using Processor Partitioning to Evaluate the Performance of MPI, OpenMP and Hybrid Parallel Applications on Dual- and Quad-core Cray XT4 Systems , 2009 .

[14]  Manish Parashar,et al.  Local recovery and failure masking for stencil-based applications at extreme scales , 2015, SC15: International Conference for High Performance Computing, Networking, Storage and Analysis.

[15]  Erwin Laure,et al.  Idle waves in high-performance computing. , 2015, Physical review. E, Statistical, nonlinear, and soft matter physics.

[16]  Gerhard Wellein,et al.  Propagation and Decay of Injected One-Off Delays on Clusters: A Case Study , 2019, 2019 IEEE International Conference on Cluster Computing (CLUSTER).

[17]  Samuel Williams,et al.  Roofline: an insightful visual performance model for multicore architectures , 2009, CACM.

[18]  A. Lumsdaine,et al.  LogGOPSim: simulating large-scale applications in the LogGOPS model , 2010, HPDC '10.

[19]  Gerhard Wellein,et al.  Delay Flow Mechanisms on Clusters , 2019 .

[20]  Adam Moody,et al.  System Noise Revisited: Enabling Application Scalability and Reproducibility with SMT , 2016, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS).

[21]  Yutaka Ishikawa,et al.  Hardware Performance Variation: A Comparative Study Using Lightweight Kernels , 2018, ISC.

[22]  Gerhard Wellein,et al.  Performance Engineering of the Kernel Polynomal Method on Large-Scale CPU-GPU Systems , 2014, 2015 IEEE International Parallel and Distributed Processing Symposium.