Effective performance problem detection of MPI programs on MPP systems: From the global view to the details

This paper presents an automatic counter instrumentation and prooling module added to the MPI library on Cray T3E systems. A detailed summary of the hardware performance counters and the MPI calls of any MPI production program is gathered during execution and written on a special syslog le. The user can get the same information on a diierent le. Statistical summaries are computed weekly and monthly. The paper describes experiences with this library on the Cray T3E systems at HLRS Stuttgart and TU Dresden. It focuses on the scalability aspects of the new interface: How to obtain the right amount of performance data to the right person in time, and how to draw conclusions for the further optimization process, e.g. with the trace-based prooling tool Vampir. Today, job accounting on MPP hardware platforms does not provide enough information about the computational eeciency nor about the eeciency of message passing (MPI) usage for both, users and computing centers. There is no information available on bandwidth and latency or integer and oating point operation rates obtained in real application runs. Therefore, users and hotline centers have no reliable information base for technical and political decisions with respect to programming and optimization investment. For a rst glance at an application, the existing trace-based prooling tools are too complicated. They can be used in small test jobs only, but not in long-running production jobs. To solve this problem, the High-Performance Computing-Center (HLRS) at the University of Stuttgart has combined the method of counter-based pro