EventSpace - Exposing and Observing Communication Behavior of Parallel Cluster Applications

This paper describes the motivation, design and performance of EventSpace, a configurable data collecting, management and observation system used for monitoring low-level synchronization and communication behavior of parallel applications on clusters and multi-clusters. Event collectors detect events, create virtual events by recording timestamped data about the events, and then store the virtual events to a virtual event space. Event scopes provide different views of the application, by combining and pre-processing the extracted virtual events. Online monitors are implemented as consumers using one or more event scopes. Event collectors, event scopes, and the virtual event space can be configured and mapped to the available resources to improve monitoring performance or reduce perturbation. Experiments demonstrate that a wind-tunnel application instrumented with event collectors, has insignificant slowdown due to data collection, and that monitors can reconfigure event scopes to trade-off between monitoring performance and perturbation.

[1]  William E. Johnston,et al.  The NetLogger Methodology for High Performance Distributed Systems Performance Analysis , 1999 .

[2]  Jason Lee,et al.  A Monitoring Sensor Management System for Grid Environments , 2000, Proceedings the Ninth International Symposium on High-Performance Distributed Computing.

[3]  John Markus Bjørndalen,et al.  Using a virtual event space to understand parallel application communication behavior , 2003 .

[4]  Jack Dongarra,et al.  Recent Advances in Parallel Virtual Machine and Message Passing Interface, 15th European PVM/MPI Users' Group Meeting, Dublin, Ireland, September 7-10, 2008. Proceedings , 2008, PVM/MPI.

[5]  Dean Sutherland,et al.  The architecture of the Remos system , 2001, Proceedings 10th IEEE International Symposium on High Performance Distributed Computing.

[6]  Jeffrey S. Vetter,et al.  Autopilot: adaptive control of distributed applications , 1998, Proceedings. The Seventh International Symposium on High Performance Distributed Computing (Cat. No.98TB100244).

[7]  Richard Wolski,et al.  The network weather service: a distributed resource performance forecasting service for metacomputing , 1999, Future Gener. Comput. Syst..

[8]  Brian Vinter,et al.  Past-Set - A Distributed Structured Shared Memory System , 1999, HPCN Europe.

[9]  Jeffrey S. Vetter,et al.  An Empirical Performance Evaluation of Scalable Scientific Applications , 2002, ACM/IEEE SC 2002 Conference (SC'02).

[10]  Nicholas Carriero,et al.  Linda in context , 1989, CACM.

[11]  Jack J. Dongarra,et al.  Review of Performance Analysis Tools for MPI Parallel Programs , 2001, PVM/MPI.

[12]  Brian Vinter,et al.  Java PastSet: a structured distributed shared memory system , 2003, IEE Proc. Softw..