Using a virtual event space to understand parallel application communication behavior

For scientific application run on clusters, communication performance becomes increasingly important when the number of cluster nodes increase. To understand the communication behavior, we have developed EventSpace, a configurable data collecting, management and observation system for monitoring low-level synchronization and communication events. Applications are instrumented by adding data collecting code in the form of event collectors to an applications communication paths. When triggered these create and store virtual events to a virtual event space. Based on the meta-data describing the communication paths, virtual events can be combined to provide different views of the applications communication behavior. We used the data collected by EventSpace to do a post-mortem analysis of a wind-tunnel application, a river simulator, global clock synchronization, and a collective operation. The views allowed us to detect anomalous communication behavior, detect load balance problems, find hotspots in a collective communication structure, synchronize the Pentium timestamp counters on the cluster nodes, and analyze the accuracy of the synchronization.

[1]  Brian Vinter,et al.  Java PastSet: a structured distributed shared memory system , 2003, IEE Proc. Softw..

[2]  Jeffrey S. Vetter,et al.  Autopilot: adaptive control of distributed applications , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[3]  Otto J. Anshus,et al.  Configurable Collective Communication in LAM-MPI , 2002 .

[4]  Jeffrey S. Vetter,et al.  An Empirical Performance Evaluation of Scalable Scientific Applications , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[5]  Nicholas Carriero,et al.  Linda in context , 1989, CACM.

[6]  Jack J. Dongarra,et al.  Review of Performance Analysis Tools for MPI Parallel Programs , 2001, PVM/MPI.

[7]  David L. Mills Improved algorithms for synchronizing computer network clocks , 1994, SIGCOMM 1994.

[8]  William E. Johnston,et al.  The NetLogger methodology for high performance distributed systems performance analysis , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[9]  Liviu Iftode,et al.  Monitoring shared virtual memory performance on a Myrinet-based PC cluster , 1998, ICS '98.

[10]  Message P Forum,et al.  MPI: A Message-Passing Interface Standard , 1994 .

[11]  Steve Sistare,et al.  MPI support in the Prism programming environment , 1999, SC '99.

[12]  Message Passing Interface Forum MPI: A message - passing interface standard , 1994 .

[13]  S. W. Kim,et al.  A Performance Analysis Tool for Distributed Virtual Shared-memory Systems , 2002, IASTED PDCS.

[14]  Brian Vinter,et al.  PATHS - Integrating the Principles of Method-Combination and Remote Procedure Calls for Run-Time Configuration and Tuning of High-Performance Distributed Applications YYYY No org found YYY , 2001 .

[15]  John Markus Bjørndalen,et al.  EventSpace - Exposing and Observing Communication Behavior of Parallel Cluster Applications , 2003, Euro-Par.