System-wide Introspection for Accurate Attribution of Performance Bottlenecks

The architecture of high-end systems is becoming ever more complex with the introduction of multi-chip, multi-core computation nodes. The behavior of shared resources such as caches, memory controllers, and communication interfaces is playing an increasing role in overall system performance. Most current performance tools are oriented toward “firstperson” performance measurement and they are generally characterizing the performance of individual threads based on on-core Hardware Performance Monitoring (HPM). O↵core resources cannot be directly monitored. In contrast, RCRtoolkit is designed to focus on those shared resources by using HPM counters associated with the shared, o↵-core parts of the system. Its resource-centric approach to performance analysis is designed to provide real time introspective feedback to applications and system software. This paper describes experiments with coupling RCRtoolkit to HPCToolkit, a first-person tool that uses event based sampling of on-core events to attribute costs to application using a hierarchical calling context model. The tools augment each other to provide new analysis capabilities and insights not available separately. The structure of RCRtoolkit is presented, as is the integration of the tools. The capabilities of the combined tools are illustrated with several examples (Lattice QCD, Lattice-Boltzmann Magneto-hydrodynamics, FFT). In particular, the combination is used to identify memory bandwidth (as opposed to latency) and memory load balancing problems and to attribute this behavior accurately to specific program constructs.

[1]  Jeffrey S. Vetter,et al.  A Dynamic Tracing Mechanism for Performance Analysis of OpenMP Applications , 2001, WOMPAT.

[2]  Robert G. Edwards,et al.  The Chroma Software System for Lattice QCD , 2004 .

[3]  Jeffrey S. Vetter,et al.  Statistical scalability analysis of communication operations in distributed applications , 2001, PPoPP '01.

[4]  Douglas Thain,et al.  Qthreads: An API for programming with millions of lightweight threads , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[5]  Aaas News,et al.  Book Reviews , 1893, Buffalo Medical and Surgical Journal.

[6]  B.P. Miller,et al.  MRNet: A Software-Based Multicast/Reduction Network for Scalable Tools , 2003, ACM/IEEE SC 2003 Conference (SC'03).

[7]  Martin Schulz,et al.  Scalable load-balance measurement for SPMD codes , 2008, 2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis.

[8]  Bronis R. de Supinski,et al.  A ROSE-Based OpenMP 3.0 Research Compiler Supporting Multiple Runtime Libraries , 2010, IWOMP.

[9]  S. Eranian Perfmon2: a flexible performance monitoring interface for Linux , 2010 .

[10]  Daniel Bedard,et al.  PowerMon: Fine-grained and integrated power monitoring for commodity computer systems , 2010, Proceedings of the IEEE SoutheastCon 2010 (SoutheastCon).

[11]  Nathan R. Tallent,et al.  HPCTOOLKIT: tools for performance analysis of optimized parallel programs , 2010, Concurr. Comput. Pract. Exp..

[13]  Lars Koesterke,et al.  PerfExpert: An Easy-to-Use Performance Diagnosis Tool for HPC Applications , 2010, 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis.

[14]  Samuel Williams,et al.  Lattice Boltzmann simulation optimization on leading multicore platforms , 2008, 2008 IEEE International Symposium on Parallel and Distributed Processing.

[15]  Barton P. Miller,et al.  Tree-based overlay networks for scalable applications , 2006, Proceedings 20th IEEE International Parallel & Distributed Processing Symposium.

[16]  Barton P. Miller,et al.  The Paradyn Parallel Performance Measurement Tool , 1995, Computer.

[17]  Robert J. Fowler,et al.  Modeling memory concurrency for multi-socket multi-core systems , 2010, 2010 IEEE International Symposium on Performance Analysis of Systems & Software (ISPASS).

[18]  Karsten Schwan,et al.  Falcon: On-line monitoring for steering parallel programs , 1998, Concurr. Pract. Exp..

[19]  Alejandro Duran,et al.  Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP , 2009, 2009 International Conference on Parallel Processing.

[20]  Allen D. Malony,et al.  The Tau Parallel Performance System , 2006, Int. J. High Perform. Comput. Appl..

[21]  Wolfgang E. Nagel,et al.  VAMPIR: Visualization and Analysis of MPI Resources , 2010 .