Performance analysis for teraflop computers: a distributed automatic approach

Performance analysis for applications on teraflop computers requires a new combination of concepts: online processing, automation, and distribution. The article presents the design of a new analysis system that performs an automatic search for performance problems. This search is guided by a specification of performance properties based on the APART Specification Language. The system is being implemented as a network of analysis agents that are arranged in a hierarchy. Higher level agents search for global performance problems while lower level agents search local performance problems. Leaf agents request and receive performance data from the monitoring library linked to the application. Our online analysis also takes into account design patterns for parallel applications. These patterns make the analysis more effective and the output more application-related. The analysis is currently being implemented for the Hitachi SR8000 teraflop computer at the Leibniz-Rechenzentrum in Munich within the Peridot project.

[1]  Graham D. Riley,et al.  Knowledge Specification for Automatic Performance Analysis - APART Technical Report , 1999 .

[2]  Graham D. Riley,et al.  Specification of performance problems in MPI programs with ASL , 2000, Proceedings 2000 International Conference on Parallel Processing.

[3]  Bernd Mohr,et al.  Automatic Performance Analysis of MPI Applications Based on Event Traces , 2000, Euro-Par.

[4]  Jesper Larsson Träff,et al.  Formalizing OpenMP Performance Properties with the APART Specification Language (ASL) , 1999 .

[5]  Barton P. Miller,et al.  Improving Online Performance Diagnosis by the Use of Historical Performance Data , 1999, SC.

[6]  Graham D. Riley,et al.  Formalizing OpenMP Performance Properties with ASL , 2000, ISHPC.

[7]  Barton P. Miller,et al.  The Paradyn Parallel Performance Measurement Tool , 1995, Computer.

[8]  Jason Lee,et al.  A Monitoring Sensor Management System for Grid Environments , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[9]  Jeffrey S. Vetter,et al.  Autopilot: adaptive control of distributed applications , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[10]  Tomàs Margalef,et al.  Automatic performance evaluation of parallel programs , 1998, Proceedings of the Sixth Euromicro Workshop on Parallel and Distributed Processing - PDP '98 -.