Identifying the Root Causes of Wait States in Large-Scale Parallel Applications

Driven by growing application requirements and accelerated by current trends in microprocessor design, the number of processor cores on modern supercomputers is increasing from generation to generation. However, load or communication imbalance prevents many codes from taking advantage of the available parallelism, as delays of single processes may spread wait states across the entire machine. Moreover, when employing complex point-to-point communication patterns, wait states may propagate along far-reaching cause-effect chains that are hard to track manually and that complicate an assessment of the actual costs of an imbalance. Building on earlier work by Meira Jr. et al., we present a scalable approach that identifies program wait states and attributes their costs in terms of resource waste to their original cause. By replaying event traces in parallel both in forward and backward direction, we can identify the processes and call paths responsible for the most severe imbalances even for runs with tens of thousands of processes.

[1]  Martin Schulz,et al.  Scalable load-balance measurement for SPMD codes , 2008, HiPC 2008.

[2]  Mariacarla Calzarossa,et al.  A methodology towards automatic performance analysis of parallel applications , 2004, Parallel Comput..

[3]  Felix Wolf,et al.  Space-efficient time-series call-path profiling of parallel applications , 2009, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis.

[4]  Virgílio A. F. Almeida,et al.  Using cause-effect analysis to understand the performance of distributed programs , 1998, SPDT '98.

[5]  Hassan M. Jafri Measuring causal propagation of overhead of inefficiencies in parallel applications , 2007 .

[6]  Jeffrey K. Hollingsworth An online computation of critical path profiling , 1996, SPDT '96.

[7]  Martin Schulz,et al.  On the Performance of Transparent MPI Piggyback Messages , 2008, PVM/MPI.

[8]  Adolfy Hoisie,et al.  Performance Analysis of Wavefront Algorithms on Very-Large Scale Distributed Systems , 1998, Wide Area Networks and High Performance Computing.

[9]  Bernd Mohr,et al.  A scalable tool architecture for diagnosing wait states in massively parallel applications , 2009, Parallel Comput..

[10]  Marc-André Hermanns,et al.  Verifying Causality between Distant Performance Phenomena in Large-Scale MPI Applications , 2009, 2009 17th Euromicro International Conference on Parallel, Distributed and Network-based Processing.

[11]  Marc-André Hermanns,et al.  Performance Simulation of Non-blocking Communication in Message-Passing Applications , 2009, Euro-Par Workshops.

[12]  Tomàs Margalef,et al.  On-Line Performance Modeling for MPI Applications , 2008, Euro-Par.

[13]  Nathan R. Tallent,et al.  Scalable Identification of Load Imbalance in Parallel Executions Using Call Path Profiles , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Felix Wolf,et al.  Scalable timestamp synchronization for event traces of message-passing applications , 2009, Parallel Comput..

[15]  M. Schulz,et al.  Extracting Critical Path Graphs from MPI Applications , 2005, 2005 IEEE International Conference on Cluster Computing.

[16]  Allen D. Malony,et al.  Phase-Based Parallel Performance Profiling , 2005, PARCO.

[17]  M. L. Norman,et al.  Simulating Radiating and Magnetized Flows in Multiple Dimensions with ZEUS-MP , 2005, astro-ph/0511545.

[18]  Wagner Meira,et al.  Waiting time analysis and performance visualization in Carnival , 1996, SPDT '96.

[19]  Bernd Mohr,et al.  Performance analysis of Sweep3D on Blue Gene/P with the Scalasca toolset , 2010, 2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW).

[20]  Felix Wolf,et al.  SCALASCA Parallel Performance Analyses of SPEC MPI2007 Applications , 2008, SIPEW.

[21]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[22]  Markus Geimer,et al.  Identifying the Root Causes of Wait States in Large-Scale Parallel Applications , 2010, ICPP.

[23]  Brian Wylie Parallel performance measurement and analysis scaling lessons , 2012 .

[24]  Martin Schulz,et al.  Scalable Critical-Path Based Performance Analysis , 2012, 2012 IEEE 26th International Parallel and Distributed Processing Symposium.

[25]  Michael Geissler,et al.  Bubble acceleration of electrons with few-cycle laser pulses , 2006 .

[26]  J. Meyer-ter-Vehn,et al.  3D simulations of surface harmonic generation with few-cycle laser pulses , 2007 .

[27]  Mary K. Vernon,et al.  Predictive analysis of a wavefront application using LogGP , 1999, PPoPP '99.